Edge Inference 边缘推断 class: yl/ei/000

Edge Inference

边缘推断

This program develops offline-first private inference systems that operate within defined thermal and latency envelopes. Work focuses on deployment pipelines for constrained hardware targets where network connectivity cannot be assumed and power budgets are fixed.

本项目开发在定义的热量和延迟包络内运行的离线优先私有推断系统。工作集中在无法假设网络连接且功耗预算固定的受限硬件目标部署管道上。

Every deployment target is characterized by a hardware constraint map that defines thermal dissipation capacity, available memory, compute throughput, and power draw limits. Inference models are compiled against these maps, and no model is deployed to a target whose constraints it cannot satisfy.

每个部署目标均通过硬件约束映射来表征,该映射定义了热耗散容量、可用内存、计算吞吐量和功耗限制。推断模型根据这些映射进行编译,任何模型不会部署到其无法满足约束的目标上。

Program Metadata 项目元数据
Domain Edge
Status PROTOTYPING
Programs Active 4
Division Edge Inference Group
Division ID yl-div-002
Classification yl/ei/000
← back to research index

Scope

研究范围

  • Thermal envelope profiling: systematic characterization of inference workload thermal signatures across target hardware, including sustained and burst load profiles under ambient temperature variation. 热包络分析:对目标硬件上推断工作负载热特征的系统化表征,包括在环境温度变化下的持续和突发负载配置文件。
  • Latency-bounded execution: inference runtime systems that guarantee response within defined latency bounds, including deterministic scheduling and memory allocation strategies that avoid garbage collection pauses. 延迟有界执行:保证在定义延迟边界内响应的推断运行时系统,包括避免垃圾回收暂停的确定性调度和内存分配策略。
  • Offline deployment pipelines: end-to-end tooling for compiling, optimizing, packaging, and deploying inference models to hardware targets without requiring network connectivity during any stage. 离线部署管道:在任何阶段都不需要网络连接的情况下,编译、优化、打包和部署推断模型到硬件目标的端到端工具。
  • Hardware constraint mapping: formal representation of target hardware capabilities and limitations, used as input to model compilation and deployment validation processes. 硬件约束映射:目标硬件能力和限制的形式化表示,作为模型编译和部署验证过程的输入。
  • Power-aware inference scheduling: dynamic workload management that adjusts inference throughput and model precision based on available power budget and battery state on portable targets. 功耗感知推断调度:根据可用功耗预算和便携式目标的电池状态调整推断吞吐量和模型精度的动态工作负载管理。

Evaluation Harness

评估框架

  • Thermal compliance testing: automated test suites that run inference workloads at defined intensity levels and verify that hardware temperatures remain within declared thermal envelopes across sustained operation periods. 热合规测试:在定义的强度级别运行推断工作负载并验证硬件温度在持续运行期间保持在声明的热包络内的自动化测试套件。
  • Latency percentile verification: measurement of inference latency at the 50th, 95th, 99th, and 99.9th percentiles, with automatic failure flagging when any percentile exceeds the target bound. 延迟百分位验证:在第50、95、99和99.9百分位测量推断延迟,当任何百分位超过目标边界时自动标记故障。
  • Offline integrity validation: verification that models deployed through the offline pipeline produce identical outputs to models deployed through the standard pipeline, within defined numerical tolerance. 离线完整性验证:验证通过离线管道部署的模型在定义的数值容差范围内产生与通过标准管道部署的模型相同的输出。
  • Power budget adherence: continuous monitoring of power draw during inference operations, with automatic throughput reduction when consumption approaches the declared power budget ceiling. 功耗预算遵守:在推断操作期间持续监控功耗,当消耗接近声明的功耗预算上限时自动降低吞吐量。

Open Questions

开放问题

  • What is the relationship between model quantization depth and thermal envelope width for the ARM-based targets profiled in yl-ei-014? Does this relationship hold across different silicon revisions? 对于在 yl-ei-014 中分析的基于ARM的目标,模型量化深度与热包络宽度之间的关系是什么?这种关系在不同芯片修订版之间是否成立?
  • Can deterministic latency bounds be maintained when inference models are updated in place on deployed hardware, or does model update require a full revalidation cycle? 当推断模型在已部署硬件上就地更新时,确定性延迟边界能否维持,还是模型更新需要完整的重新验证周期?
  • What is the minimum viable model size for useful inference on sub-1W power budgets across the hardware targets currently under study? 在当前研究的硬件目标上,低于1W功耗预算下有效推断的最小可行模型大小是多少?

Lab Notes

实验笔记

yl-ei-031

Thermal profiling of the Cortex-A78 target under sustained INT8 inference shows a steady-state junction temperature of 74C at 25C ambient, with a thermal time constant of approximately 180 seconds. This leaves 11C of thermal headroom relative to the declared 85C envelope ceiling. Burst workloads at FP16 precision exceed the envelope within 45 seconds and require active throttling.

在持续INT8推断下对Cortex-A78目标的热分析显示,在25C环境温度下稳态结温为74C,热时间常数约为180秒。相对于声明的85C包络上限,留有11C的热余量。FP16精度的突发工作负载在45秒内超出包络,需要主动降频。

yl-ei-032

The offline deployment pipeline now supports three target architectures (ARM Cortex-A series, RISC-V with vector extensions, and x86-64 with AVX-512) with a unified constraint map format. Model compilation time averages 340 seconds for a 1.2B parameter model at INT8 quantization. The compiled binary is self-contained and requires no runtime dependencies beyond the target OS kernel.

离线部署管道现支持三种目标架构(ARM Cortex-A系列、带向量扩展的RISC-V和带AVX-512的x86-64),使用统一的约束映射格式。对于INT8量化的12亿参数模型,模型编译时间平均为340秒。编译后的二进制文件是自包含的,除目标操作系统内核外不需要运行时依赖。

yl-ei-033

离线优先设计哲学:本组将离线能力视为系统设计的首要约束而非事后补充。这意味着网络连接被视为可选增强而非基础需求。这一立场对架构决策产生深远影响:所有模型更新必须支持离线验证,所有遥测数据必须支持本地存储和延迟上传,所有认证机制必须在无网络条件下可用。当前正在形式化"离线完整性等级"分类体系,定义不同程度的离线自治能力。

Citations

参考文献

  • yl-ei-ref-001
    "Thermal Envelope Characterization for Edge Inference Targets." Technical Report yl-tr-024. Edge Inference Group. 2025.
  • yl-ei-ref-002
    "Deterministic Latency Bounds in Constrained Inference Runtimes." Internal Working Paper, Yueqian Labs. 2025.
  • yl-ei-ref-003
    "Offline Deployment Pipeline Architecture: Design and Validation." Technical Report yl-tr-027. 2025.
  • yl-ei-ref-004
    "Power-Aware Inference Scheduling on Sub-5W Hardware." Lab Report yl-ei-lr-012. Edge Inference Group. 2024.