Stop routing docstrings to 70B models with on-device AI on Snapdragon

Post Content

Your AI coding assistant calls a 70B+ cloud model just to add a docstring. Snapdragon X2 Elite’s 80 TOPS NPU changes that. In this session, build a three-tier inference routing architecture—on‑device (≤13B), on‑prem (14B–34B), and cloud (70B+)—cutting cloud tokens by 67%, latency by 70%, and keeping most code local. Includes routing logic, quantization trade‑offs, and a deployable classifier.

Seating for this session is first-come, first-served. Add it to your schedule to plan your day and arrive early to secure a spot.

𝗦𝗽𝗲𝗮𝗸𝗲𝗿𝘀:
* Alberto Martinez

𝗦𝗲𝘀𝘀𝗶𝗼𝗻 𝗜𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻:
This is one of many sessions from the Microsoft Build 2026 event. View even more sessions on-demand and learn about Microsoft Build at https://build.microsoft.com

BRKSP90 | English (US) | Developer tools & frameworks

Breakout | (200) Intermediate

#MSBuild

Chapters:
0:00 – Introduction of Alberto Martinez and his role at Qualcomm
00:02:05 – Exploring agentic AI, orchestration, and their future
00:05:10 – Provocative message: Stop routing docstrings to large language models
00:11:18 – Overview of X2 Elite Hardware Capabilities
00:12:10 – Discussion on Quantization and Precision Trade-offs
00:22:00 – Demonstrating 4x cost savings and efficiency
00:28:01 – Using entropy and logic depth to determine routing and compute distribution
00:31:17 – Summary of evidence showing 73% efficiency benchmark and capacity discussion
00:35:00 – Explanation of building custom classifiers and orchestrator integration Read More Microsoft Developer