In partnership with

In Today’s Issue:

🧩 What you trade away when you rebuild an agent harness for small models

🧠 Why the orchestrator runs on an open 14B base, not a giant proprietary one

📊 How a 9B computer-use model ends up in the same conversation as the frontier

When the agent stops and asks you first, and why that is learned, not hard-coded

💻 The realistic path to agents that run on your own hardware

And more intelligence from the industry...

A note from us: University students receive our Saturday Deepdive for free when they register with their university email address at: https://getsuperintel.com/plus-whitelist

Dear Readers,

For two years, the story of AI agents has been a story of scale: bigger models, longer context windows, more compute. This week runs the other way. I spoke with Ahmed Awadallah, Partner Research Manager at Microsoft Research AI Frontiers, about the team's latest release, a co-designed stack of three parts. There is MagenticLite, the agent app; MagenticBrain, the orchestrator that plans, codes, and delegates; and Fara1.5, the computer-use models that actually drive the browser. All of it is built so that small models can do real agentic work.

The headline result is hard to ignore: a 9-billion-parameter model that nearly doubles its predecessor on web navigation, and a 27B sibling that trades punches with frontier computer-use agents like Operator and Gemini 2.5 Computer Use. But the sharper idea sits underneath. Ahmed argues that in agentic AI, more and more of the capability lives in the scaffolding around the model, the harness, the data pipeline, the orchestration, and not only in the weights. Or as he puts it, the moat is built around the model, not just inside it. We talked about what you trade away when you build for small models, when an agent should stop and ask before it acts, and how close we really are to agents that run on your own hardware.

All the best,

Kim Isenberg

In Conversation: Ahmed Awadallah, Partner Research Manager, Microsoft Research AI Frontiers

TL;DR

This week: Microsoft Research's Ahmed Awadallah on why the next great AI agent might be small enough to run on your laptop.

Three related releases work as one stack: the MagenticLite app, the MagenticBrain orchestrator, and the Fara1.5 computer-use models. (Microsoft Research)

Q1. Rebuilding the agent harness specifically for small models — what capabilities did you have to trade off compared to a frontier-model harness, and what surprised you about what small models could still handle?

The biggest challenge was a smaller effective context window, so we approached the model, harness, and UX as a co-designed system rather than separate layers. Small models degrade quickly as context grows, so the harness had to be designed to be highly selective in what it retains in context: it curates what the orchestrator sees at each step, summarizes earlier turns, and offloads the rest. It also uses structured hand-offs. Instead of asking one model to do everything, MagenticBrain delegates browser work to Fara1.5 so each component only carries the context it needs. A frontier harness can often rely on expanding the context window. We couldn't, and that constraint pushed us toward a more composable architecture.

This is where a lot of agentic capability actually comes from. You can do a surprising amount with smaller models if you co-design the scaffolding and the experience around them, not just the model itself.

What surprised us was how much we could unlock on a relatively small base model. MagenticBrain switches fluidly between calling a tool, writing five lines of Python in the terminal, and delegating a UI task, all in a single trajectory. That "decide, code, coordinate" role is usually where teams feel forced to reach for their largest model. There's still a size floor, though. You need enough base capability for reliable long-form reasoning and code writing, which is why we landed on 14B. Below that, the base capabilities start to fray.

Q2. MagenticBrain is fine-tuned on Qwen 3 14B for orchestration. What drove the choice of that base model, and how do you think about building the planner/delegator layer on an open external base?

For research purposes, we are using the Qwen model family as a base model for both our CUA and MagenticBrain work, for two main reasons: it gives us a strong base for infusing agentic capabilities, and it's open-weight. Our experience post-training Fara1.0 showed us how much agentic capability these models can absorb, and 14B size in particular sits at a practical sweet spot for orchestration—large enough to support planning, code generation, tool use, and delegation within a single model, while still small enough to be deployable and affordable as part of an agentic system.

Our main goal with MagenticBrain was to explore how far we could push agentic capabilities through our post-training recipes, and open weights let us iterate quickly and test our hypotheses.

The training recipes, data generation techniques, and evaluation methodologies we're developing aren't tied to any single base model, this lets us rapidly validate new ideas and recipes on open weights and then bring the most promising ideas into the products people actually use.

Q3. Fara1.5-9B nearly doubles Fara-7B on Online-Mind2Web, and the 27B reportedly competes with frontier computer-use agents like Operator and Gemini 2.5 Computer Use. Where does a small computer-use model still fall short of frontier-scale ones, and does beating them at this size change how you think about the "moat" in agentic AI?

Small computer-use models still have limitations compared to the frontiers. The clearest is the long tail of rare web tasks and uncommon GUI elements, which is why we're also investing in WebTailBench, a benchmark built specifically to measure progress on the tail. Fara1.5 made significant progress compared to Fara-7B on tail tasks, but we still have a gap with frontier models. Interestingly, we have been observing that small models take the right intermediate steps but may not reach the final outcome (process success is consistently higher than outcome success). This is good news because we are seeing evidence that the model can continue to improve in these situations with reinforcement learning, where the model gets to learn from its own attempts at solving the problem.

It's worth being clear about how we see smaller models fitting in: we think of more efficient models as an additional tool in our toolbox for building agents, not a replacement for bigger ones. Smaller models naturally shine in bounded, specialized tasks, and especially in tasks where most of the capability lives in the tool space around the model rather than in the weights themselves. Agentic models, and computer-use models in particular, fit that pattern well. Much of the work is orchestration, action, and tool use. It's a very different story for knowledge-intensive tasks, where smaller models are at a real disadvantage because raw capacity is exactly what's being tested.

As smaller models clear higher capability bars, they become a reliable tool in how we build agents, and that shift is exactly what changes how we think about the moat. The bet underneath MagenticLite is that agentic capability is more about orchestration, action, and tool use than raw knowledge, and being in the same conversation as frontier CUAs at 9B/27B is some evidence for that. The durable advantages increasingly look like the data pipeline (FaraGen and the environments behind it), the harness around the model, codesign across the app, orchestrator, and CUA, the evaluation infrastructure, and the trust and UX layer that makes any of this usable. Scale still matters at the frontier, but "biggest model wins" isn't the right framing for agentic systems. The moat is built around the model, not just inside it.

Cut Lead Review From Hours To Minutes

Sign up for a free trial of Attio, the agentic CRM.

Ask Attio to build a daily workflow that surfaces the deals that need your attention today, like anything with a stage change, a recent reply, or a new signal in the last 24 hours.

Review your pipeline in Claude, synced live from Attio via MCP.

That's it.

Fara1.5 scales cleanly from 4B to 27B, and the 27B model reaches 72.0 on Online-Mind2Web, ahead of Operator, Gemini 2.5 Computer Use, and Navigator. (Microsoft Research)

Q4. Fara1.5 is trained to stop and ask the user at critical points — you've defined those as missing information, ambiguity, or unauthorized irreversible actions. How much of that behavior is learned versus enforced by the harness, and how do you tune it so the agent doesn't interrupt too often?

The split is very clean once you look at the two surfaces. For Fara1.5 on the browser, the asking behavior is almost entirely learned, not enforced by the harness. ask_user is just one action in Fara1.5's action space, sitting alongside clicks, typing, and the other meta-actions like memorizing facts. At each step the model predicts a single next action, and "stop and ask" is simply one of the actions it can predict. Nothing in the runtime forces a halt - if the policy doesn't emit ask_user, the agent keeps going. So the trigger itself lives in the weights. Browser tasks are full of gray areas, and you can't write a rule that says "always pause on clicking submit" without breaking thousands of benign flows. The model has to make a judgment call: is this submit button posting a comment, or is it confirming a payment? Is this form field optional context or a credentialed login? That judgment is learned, and there's no harness override standing behind it.

For MagenticBrain on the terminal it's different, terminal actions are categorical enough that the harness can handle it directly. Certain commands route through an explicit approval gate regardless of what the model thinks, because there's no gray area on something like a destructive file operation.

The tuning question on Fara1.5 is important because of that gray area. In Fara-7B, we erred on the side of caution and trained the model to stop at any questionable scenario. However, this was too sensitive at times and it stopped on things that were actually benign (e.g., filling out a form field), which created friction when using the agent. For Fara1.5 we recalibrated this behavior using real-use data: scenarios where pauses were genuinely necessary stayed (transactions, credentialed sign-ins, anything that writes to the world in an irreversible way), and the noisy ones got pulled back. The internal mental model we use is: would a careful human assistant doing this task for you actually stop and ask here? If the answer is no, we don't want Fara1.5 stopping either.

During data generation for Fara1.5, we used a user simulator and a dedicated user-interaction verifier. The verifier checks whether the trajectory handled critical points correctly — did it ask when information was missing, when the task was underspecified, or before an unauthorized irreversible action?  So the model is learning the right behavior and the enforcement is coming from the verifier signal, which shows up as learned behavior at runtime.

Q5. Bigger picture: the stack is framed around agents that run on the user's own hardware. Does a co-optimized small-model stack signal a real near-term path to fully on-device agents, and what's a realistic timeline?

Yes, it does signal a real near-term path. The clearest evidence is that you can get genuinely useful agentic behavior at 9B and 14B once you codesign across the experience, the models, and the harness. Capability doesn't have to live entirely in the weights.

In the near term, I would frame it as "hybrid-first with a growing local share" rather than a switch to everything-on-device. We're used to thinking of a model as an agent, but we're already seeing a shift toward multi-model, hybrid agents by design. Picture a Fara-class CUA running on-device and escalating the harder steps or subtasks to a larger cloud model. That's exactly how you close the robustness gap I'd flag for any small model (recall the WebTailBench long-tail limitations).

It's also worth separating two claims that often get conflated: "the model runs on-device" and "the agent runs fully on-device." The latter depends on more than the model. New PC hardware, inference optimization, and enterprise-grade sandboxed environments for agents all need to mature in parallel. A full mature local/hybrid stack is where we are heading. Satya's Build keynote this year framed it as "unmetered intelligence in every desk and every home," with everyone running a full local agentic loop, and our own Aion 1.0—a reasoning and tool-calling model built for fully local agentic capabilities—lands in the coming months.

The thread running through Ahmed's answers: you don't have to pack all of an agent's intelligence into the model. Co-design the model, the harness, and the experience together, and surprisingly small models can plan, write code, use tools, and drive a browser, at a fraction of the cost and with the option to keep your data on your own device.

  • Constraints forced a better design. A small context window pushed the team toward composable, delegating agents instead of one model doing everything: "that constraint pushed us toward a more composable architecture."

  • The moat moved outside the weights. The durable edge is the data pipeline, the harness, and the evaluation stack: "The moat is built around the model, not just inside it."

  • Knowing when to stop is learned, not scripted. On the browser, asking the user is just one action the model can choose, tuned against a simple test: "would a careful human assistant doing this task for you actually stop and ask here?"

The optimistic read: the assumption that every capable agent needs a giant model is starting to crack.

Microsoft Research's small-model-first stack shows you can get genuinely useful agentic behavior at 9B and 14B once the model, the harness, and the app are designed together, which opens a credible path to hybrid agents that run largely on your own hardware and reach for the cloud only on the hard parts. Ahmed is careful not to oversell it: the frontier still wins on the long tail of rare, messy tasks, and a fully on-device agent also needs new PC hardware and proper sandboxing to catch up. But the direction is set. With Microsoft's own Aion 1.0 local model due in the coming months, the question is shifting from how big your model is to how well you built everything around it.

About the Interview Partner

Ahmed Awadallah - Partner Research Manager, MagenticLite team, Microsoft Research AI Frontiers

Ahmed Awadallah is a Partner Research Manager at Microsoft Research AI Frontiers, where he leads teams of researchers and engineers working on large language model post-training, alignment, and efficiency, which is the work of turning frontier-level capability into models and systems that are cheaper, safer, and easier to actually run.

His teams are behind a string of widely used research releases, including the AutoGen multi-agent framework and the Magentic line of agentic systems, of which MagenticLite, MagenticBrain, and Fara1.5 are the newest generation. A recurring theme in his work is moving results out of the lab and into the products people rely on every day.

He is based in Redmond, Washington, and completed his graduate studies at the University of Michigan.

Reply

Avatar

or to participate

Keep Reading