Deploying small language models (SLMs) on ultra-low-power edge devices requires careful optimization to meet strict memory, latency, and energy constraints while preserving privacy. This paper presents a systematic approach to adapting SLMs for Tiny ML, focusing on model compression, hardware-aware quantization, and lightweight privacy mechanisms. We introduce a sparse ternary quantization technique that reduces model size by 5.8× with minimal accuracy loss and an efficient federated fine-tuning method for edge deployment. To address privacy concerns, we implement on-device differential noise injection during text preprocessing, adding negligible computational overhead. Evaluations on constrained devices (Cortex-M7 and ESP32) show our optimized models achieve 92% of the accuracy of full-precision baselines while operating within 256KB RAM and reducing inference latency by 4.3×. The proposed techniques enable new applications for SLMs in always-on edge scenarios where both efficiency and data protection are critical.
Copyrights © 2024