Distilled DeepSeek R1 models will be coming to Copilot+ PCs, starting with Qualcomm Snapdragon X first: Microsoft

By Estuti Bajpai Published on January 30, 2025, 13:34 IST

Yesterday Microsoft introduced the Chinese AI company DeepSeeks R1 model to its Azure AI Foundry platform and GitHub. This AI model has turned out to be a strong competitor for the widely used chatbots like ChatGPT and Gemini, among others.

Now in a new blog post, Microsoft has announced that they are bringing NPU-optimized versions of DeepSeek R1 directly to Copilot+PCs, starting with Qualcomm Snapdragon X first, followed by Intel Core Ultra 200V and others.

It is revealed that the first release, DeepSeek R1-Distill-Qwen-1.5B will be available in the AI Toolkit, with the 7B and 14B variants arriving soon. These optimised models will let developers build and deploy AI-powered applications that run efficiently on-device, taking full advantage of the powerful NPUs in Copilot+PC.

To see DeepSeek in action on your Copilot+ PC, simply download the AI Toolkit VS Code extension. The DeepSeek model optimized in the ONNX QDQ format will soon be available in AI Toolkit’s model catalog, pulled directly from Azure AI Foundry.

It is revealed that distilled Qwen 1.5B consists of a tokenizer, an embedding layer, a context processing model, a token iteration model, a language model head, and a tokenizer. Microsoft uses 4-bit block-wise quantizations for the embeddings and language model head and runs these memory-access heavy operations on the CPU. While the Qwen 1.5B release from DeepSeek does have an int4 variant, it does not directly map to the NPU due to the presence of dynamic input shapes and behavior.

Additionally, the ONNX QDQ format is used to enable scaling across a variety of NPUs in the Windows ecosystem. To achieve dual goals of low memory footprint and fast inference, like Phi Silica, two key changes are made- leveraging a sliding window design that unlocks super-fast time to the first token and long context support despite not having dynamic tensor support in the hardware stack and second, Microsoft uses 4-bit QuaRot quantization scheme to truly take advantage of low bit processing.

It is said that with the speed and power characteristics of the NPU-optimised version of DeepSeek R1 models users will be able to interact with these round-breaking models entirely locally.

Copilot+PCs DeepSeek R1 Microsoft