Local-Hosted Models
本地托管模型
This program develops secure runtimes and airgapped workflows for hosting inference models on local infrastructure. Research covers local retrieval-augmented generation, fine-tuning under defined constraint surfaces, and data sovereignty enforcement mechanisms that ensure no data leaves the operator's physical control.
本项目开发用于在本地基础设施上托管推断模型的安全运行时和气隔工作流。研究涵盖本地检索增强生成、在定义约束面下的微调,以及确保数据不离开运营者物理控制的数据主权执行机制。
The program operates on the principle that model hosting should be fully auditable by the operator, with no external dependencies that could compromise data handling guarantees. Every runtime includes built-in telemetry for resource consumption, access logging, and integrity verification.
本项目遵循以下原则:模型托管应完全可由运营者审计,不依赖可能损害数据处理保证的外部组件。每个运行时包含资源消耗、访问日志和完整性验证的内置遥测功能。
Scope
研究范围
- Airgapped runtime environments: containerized inference runtimes that operate without any network interface enabled, using cryptographic verification to ensure binary integrity and prevent unauthorized modification. 气隔运行时环境:在不启用任何网络接口的情况下运行的容器化推断运行时,使用加密验证确保二进制完整性并防止未经授权的修改。
- Local retrieval-augmented generation: document retrieval and embedding systems that run entirely on local storage, with index construction, query processing, and result ranking performed without external API calls. 本地检索增强生成:完全在本地存储上运行的文档检索和嵌入系统,索引构建、查询处理和结果排序均在不调用外部API的情况下执行。
- Fine-tuning constraint surfaces: formal definitions of the parameter space, data requirements, and compute budgets for local fine-tuning operations, ensuring that fine-tuning does not degrade model behavior outside the target domain. 微调约束面:本地微调操作的参数空间、数据需求和计算预算的形式化定义,确保微调不会在目标领域之外降低模型行为。
- Secure model serving: runtime isolation, memory encryption, and access control mechanisms that protect model weights and inference inputs from unauthorized access, including from processes running on the same host. 安全模型服务:运行时隔离、内存加密和访问控制机制,保护模型权重和推断输入免受未经授权的访问,包括来自同一主机上运行的进程。
- Data sovereignty enforcement: technical controls and audit mechanisms that provide verifiable guarantees about data location, access, and lifecycle, aligned with jurisdictional requirements for sensitive data handling. 数据主权执行:提供关于数据位置、访问和生命周期的可验证保证的技术控制和审计机制,符合敏感数据处理的司法管辖要求。
Evaluation Harness
评估框架
- Airgap integrity verification: automated tests that confirm no network traffic is generated during runtime operation, including DNS lookups, broadcast packets, and any other network-layer activity. 气隔完整性验证:确认运行时操作期间不生成网络流量的自动化测试,包括DNS查询、广播数据包和任何其他网络层活动。
- Retrieval accuracy benchmarking: comparison of local retrieval system results against reference implementations using standardized document collections, measuring recall, precision, and latency. 检索准确性基准测试:使用标准化文档集合将本地检索系统结果与参考实现进行比较,测量召回率、精确度和延迟。
- Fine-tuning regression testing: automated evaluation suites that verify fine-tuned models maintain baseline performance on held-out evaluation sets outside the fine-tuning domain. 微调回归测试:验证微调模型在微调领域之外的留出评估集上维持基线性能的自动化评估套件。
Open Questions
开放问题
- What is the minimum local storage capacity required for a retrieval-augmented generation system to provide acceptable coverage across a 500,000-document corpus with the current embedding model? 使用当前嵌入模型,检索增强生成系统在500,000文档语料库上提供可接受覆盖率所需的最低本地存储容量是多少?
- Can fine-tuning constraint surfaces be automatically derived from the target domain corpus characteristics, or do they require manual specification by a domain expert? 微调约束面能否从目标领域语料库特征自动派生,还是需要领域专家的手动指定?
- Under what conditions does memory encryption for model weights introduce latency that exceeds the declared latency bounds for inference operations? 在何种条件下,模型权重的内存加密会引入超过推断操作声明延迟边界的延迟?
Lab Notes
实验笔记
The airgapped runtime container has been validated against a comprehensive network traffic audit covering 72 hours of continuous operation under varied workloads. Zero network packets were detected at the host network interface. The runtime uses a custom init system that explicitly removes network namespace capabilities before loading the inference engine. Binary integrity is verified at startup using SHA-256 checksums against a locally stored manifest.
气隔运行时容器已通过涵盖不同工作负载下72小时连续运行的综合网络流量审计验证。在主机网络接口未检测到网络数据包。运行时使用自定义init系统,在加载推断引擎前显式移除网络命名空间能力。启动时使用SHA-256校验和对本地存储的清单进行二进制完整性验证。
Local RAG performance on a 200,000-document test corpus: index construction completed in 4.2 hours on a single GPU node. Query latency at the 99th percentile is 340ms, which includes embedding generation, approximate nearest neighbor search, and reranking. Recall at k=10 is 0.87 against the reference implementation baseline. Storage footprint for the full index is 48GB, dominated by the document embeddings at float16 precision.
在200,000文档测试语料库上的本地RAG性能:索引构建在单GPU节点上耗时4.2小时。第99百分位的查询延迟为340毫秒,包括嵌入生成、近似最近邻搜索和重排序。在k=10时的召回率相对参考实现基线为0.87。完整索引的存储占用为48GB,主要由float16精度的文档嵌入组成。
本地运行哲学:本项目的核心立场是,推断系统的安全性保证不应依赖于网络架构或云服务提供者的承诺。安全性必须在物理层面可验证。这意味着运营者应能通过物理检查确认数据未离开其控制范围。当前工作正在定义"本地运行完整性等级",从L1(网络隔离但共享存储)到L4(完全物理隔离,包括独立电源和屏蔽)。大多数商业部署场景需要L2或L3级别。L4级别保留用于需要满足特定监管要求的场景。
Citations
参考文献
-
yl-lh-ref-001
"Airgapped Runtime Design for Local Inference Systems." Technical Report yl-tr-033. Local Compute Program. 2025. -
yl-lh-ref-002
"Local Retrieval-Augmented Generation: Architecture and Performance Benchmarks." Internal Working Paper, Yueqian Labs. 2025. -
yl-lh-ref-003
"Fine-Tuning Constraint Surfaces: Preventing Domain Regression in Locally Adapted Models." Lab Report yl-lh-lr-009. 2025. -
yl-lh-ref-004
"Data Sovereignty in Local Inference Deployments: Technical Controls and Audit Frameworks." Technical Report yl-tr-036. 2024.