Microsoft has introduced Mu, a compact on-device language model designed to run entirely on neural processing units (NPUs) in Copilot+ PCs. It is already powering the AI agent in the Windows Settings app for users in the Dev Channel under the Windows Insider Program.

Mu is built on a Transformer encoder–decoder architecture, enabling faster, more efficient inference by separating input and output tokens. According to Microsoft, the model achieves over 100 tokens per second and reduces first-token latency by 47% compared to similarly sized decoder-only models.

“This model addresses scenarios that require inferring complex input-output relationships,” the company said in a blog post. “It delivers high performance while running locally.”

The 330 million parameter model was trained using Azure’s A100 GPUs and fine-tuned through a multi-phase process. Mu uses features like dual LayerNorm, rotary positional embeddings, and grouped-query attention to improve inference speed and accuracy within constrained edge-device budgets.

To meet device-specific requirements, Microsoft applied post-training quantisation, converting model weights from floating-point to lower-precision formats, such as 8-bit and 16-bit integers. These optimisations were carried out in collaboration with hardware partners including AMD, Intel, and Qualcomm. On Surface Laptop 7, Mu can generate over 200 tokens per second.

One of Mu’s first applications was as an agent in Windows Settings. It maps natural language queries to system actions, allowing users to adjust system settings using everyday language. The agent is integrated into the Settings search box and activates for multi-word queries, while short or vague inputs continue to trigger traditional search responses.

While a larger Phi LoRA model initially met accuracy benchmarks, its latency exceeded acceptable limits. Mu, with task-specific fine-tuning, managed to meet both performance and latency goals. Microsoft said it scaled the training dataset to 3.6 million samples and expanded support from 50 to hundreds of system settings, using automated labelling, prompt tuning, and noise injection to improve model precision.

“We observed that the model performed best on multi-word queries that conveyed clear intent, as opposed to short or partial-word inputs, which often lack sufficient context for accurate interpretation,” the blog noted. In cases where a user query may imply multiple settings, such as “increase brightness” on a dual-monitor setup, training data was prioritised to reflect the most common user scenarios.

Mu’s development builds on earlier work with the Phi and Phi Silica models and is expected to serve as a foundation for future AI agents on Windows devices.

The post Microsoft Launches Mu, a Small Language Model That Runs Locally on Copilot+ PCs appeared first on Analytics India Magazine.