Skip to content

Models All the Way Down

We start today with World Models — a phrase that has, in the last eighteen months or so, migrated out of the specialist robotics and reinforcement-learning vocabulary into the mainstream of AI discourse.

Nature’s explainer of 28 April is a useful place to begin. The premise: large language models, for all their virtuosity, do not always make accurate predictions about the physical world. Push an object off the edge of a table and it falls; ask a generative model what happens when a car drives off a cliff and the answer may be charmingly imaginative rather than physically correct. This matters rather a lot if the model is supposed to be driving the car.

A ‘world model’, in the sense the term is now being used, is an AI trained on real-world data — typically thousands of hours of video plus physics-faithful simulations — that can embody a consistent, explorable, often interactive 3D environment. Think ‘first-person video game in which the physics actually work’. Yann LeCun’s AMI Labs in Paris has raised over $1bn on the thesis (a record initial round for a European company); Google DeepMind’s Genie 3 generates photorealistic environments you can walk around inside; Nvidia’s Cosmos and Runway’s GWM-1 are doing related work. The promise is twofold: a safe and fast substrate for training robots and self-driving cars, and a richer environment for AI research more generally.

Two papers from the last fortnight are worth flagging.

“Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond” (24 April) is taxonomic in the best sense. It tries to unify the rather scattered uses of ‘world model’ across reinforcement learning, video generation, GUI/web agents, social simulation and AI-for-science. The contribution is to clarify what kind of world model an agent actually needs in any given setting, and how the failure modes differ by domain. If you are building anything in this space, this is the conceptual scaffolding you want before committing concrete.

“Learning Reasoning World Models for Parallel Code” (22 April, revised 26 April) is the one that made us sit up. It extends the world-model idea well beyond physics: a coding agent learns to predict the outcomes of tool calls — things like data races and performance profiles — without actually running them. The reported gains suggest that an internal world model may sometimes substitute for expensive external tool execution. If that holds up, it is a meaningful piece of the agent-cost puzzle we come to in a moment.

The trend matters for two further reasons. First, it is another sign that the path from LLM to genuinely useful agent runs through grounded representations of how things work, not just what people have written about how things work. Second, the same architectural impulse — give the agent a model of its environment so it can plan and pre-compute — runs straight through every other conversation we are having about cost, orchestration and governance. With which segue, the rest of today.

Vibe coding gets a mobile (and a hostile) reception

Apple has pulled ‘Anything’ from the App Store, one of a number of ‘vibe coding’ apps that allow users to summon software into existence without writing a line of code. The cardinal sin: facilitating apps that do not go through Apple’s review process. And yet Lovable has just launched its own vibe-coding app on iOS and Android. Either the line is moving, or different vendors are quietly probing exactly where Apple draws it. The deeper question is whether ‘low-code’ was always destined to mean ‘no-code’, and whether ‘no-code’ inevitably bumps up against the platform owners.

Meanwhile Poolside has launched Laguna XS.2 — a free, open coding model designed to run locally for agentic work. Open weights and ‘on the laptop’ coding agents continue to creep up the capability ladder. If you were drawing your enterprise reference architecture today, you would want to leave a slot for ‘local agent’ alongside ‘cloud frontier’.

Agents need conductors

OpenAI’s open-source Symphony spec for Codex orchestration is a tacit admission that one big, clever model is not, on its own, the answer. Real work in software development — or anywhere else — requires multiple agents specialised for different sub-tasks, tied together by some sort of conductor.

This dovetails neatly with a recent exchange we had with Edouard Reinach of Trampoline.ai about Recursive Language Models. Rather than send a big task to an expensive frontier model and ask it to JFDI, you ask the frontier model to “JFDI recursively, by sending sub-tasks to a cheaper model.” Maximise bang-for-buck. Trampoline have helpfully open-sourced a runtime called predict-rlm to demonstrate the idea, complete with DSPy signatures and interpretable trajectories. This feels less like an optimisation and more like the natural shape of how knowledge work decomposes — goals nested inside goals, the senior partner not debugging the spreadsheet.

The bill comes due

The economics will force this question whether you want to ask it or not. Atlassian and HubSpot are joining the shift away from AI flat fees and GitHub Copilot is moving to usage-based billing. Translation: the unit economics of the all-you-can-eat AI buffet are not pretty for the vendors, and they will be sharing the bad news with the buyers. Recursive Language Models, model cascades, the local Laguna agent and — to bring us back to the top — internal world models that pre-compute tool outcomes all become not just architecturally interesting but financially obligatory. Even GitHub itself is not immune: a recent service outage is a polite reminder that the entire industrial supply chain of software now depends on someone else’s plumbing.

The hyperscaler chess game continues. OpenAI is making its models available via AWS and the next phase of the Microsoft–OpenAI partnership makes the relationship rather less exclusive than it was. If you were a CIO who had bet your AI strategy on a single supply line, this is your prompt to revisit the assumption.

The governance shoe drops

An AI agent recently destroyed a company’s production database, and apparently confessed in writing. It is the canary; we should expect more of the coal mine. The pope hasn’t let his recent spat with the US Administration slow down his governance push. In Brussels, EU member states and lawmakers have failed to agree even on a watered-down version of AI rules, and Google is facing pressure to open Android to rival AI services.

Most operationally interesting, though, is the emerging arms race to keep AI agents from running wild with corporate credit cards. This is where Robert Simon’s ‘Levers of Control’ that we keep returning to start to look unfashionably contemporary. Belief Systems, Boundary Systems, Diagnostic and Interactive Control are about to be implemented not in a manager’s expense policy but in a runtime guardrail.

Rolling it out anyway

The buyer’s side is not waiting. Accenture is rolling out Copilot to all 743,000 of its employees — either a robust vote of confidence in Microsoft or, given the bullet point above, a fairly significant live test of unit economics. In a less reassuring experiment, the New York Times reports on what happens when an AI agent runs a store in San Francisco. (Spoiler: results vary.)

The ‘desktop assistant’ category continues to fill out. AWS has launched Amazon Quick, Anthropic has positioned Claude for Creative Work, and even the Bloomberg Terminal is getting an AI makeover, like it or not. We are genuinely spoilt for choice on the question of which AI sits in the corner of the screen. For a state-of-the-nation review of Claude specifically — including the architectural principles and future challenges — there is a comprehensive recent paper worth a look.

A thicker Layer Cake

So our Layer Cake does not need replacing. But it does need a few new components:

  • A world-model layer beneath the language layer — the grounded, physics-and-tool-aware substrate that makes agentic action plausible.
  • A local-agent layer alongside the frontier model — Laguna and friends.
  • An explicit orchestration layer — Symphony, Recursive Language Models, model cascades.
  • A runtime governance layer — the Levers of Control made executable, including spend controls and agent permission boundaries.
  • A more honest commercial layer reflecting the move from flat fees to consumption.

The connecting thread is models within models. The world model is a model of the environment, embedded in the agent. The recursive language model is a model calling sub-models. Symphony is a model orchestrating other models. The runtime guardrail is, in effect, a model of acceptable behaviour wrapped around the rest. If AI is going to do knowledge work, then the natural decomposition of knowledge work — goals nested in goals, big problems split into smaller ones, the world to be reasoned about before acting — implies the natural decomposition of model calls. Sending a bot to do a bot’s job is not just cost-optimisation; it may actually be the right architecture.

In which case we are about to find out whether the self-healing company we hypothesised three years ago was a vision or a warning. The agents are arriving, the conductors are being written, the bills are being recalculated, the regulators have noticed — and the models, increasingly, have models of their own. As ever — taken at the flood…

Leave a Reply

Discover more from Standswell

Subscribe now to keep reading and get access to the full archive.

Continue reading