We can build precise, scalable things from imprecise parts.
People keep asking me a version of the same question about agentic systems and LLM-based coding: if these models can be fast and productive but also make mistakes, how can we trust them? The short answer is that reliability doesn’t have to come from every part of a system. You can build highly reliable, large-scale systems out of unreliable components—as long as you engineer a strong, constrained, and well-designed foundation beneath them. We’ve done this before in other domains, and the pattern is always the same.
That pattern is straightforward: take something chaotic, cheap, and easy to scale; place it on top of a carefully engineered substrate that enforces consistency (but can be harder to scale); and let that foundation absorb or neutralize the chaos. This shows up everywhere in engineering history.
Early precision manufacturing and later distributed computing are two examples that followed this exact structure: a small, difficult-to-scale reliable core enabling a massive, easy-to-scale unreliable layer. Those systems worked not because each part was perfect, but because the right parts were engineered to be predictable.
I want to dive into those examples a bit.
Precision Engineering and the rise of the assembly line.
In the early part of the industrial revolution1 (and before that) everything was built by hand. The quality varied - some craftsmen were very good at what they did, but no matter how good, nothing was interchangeable, because all of the parts were made bespoke. (You can think of the current software world (pre LLM) as being largely like this: software is built carefully, slowly, and expensively, and little of it interoperates - where it does, it takes a lot of time and effort.)
We think of Henry Ford as having invented the factory production line, and he did, but what led up to that was the idea of precision - the ability to measure and machine parts precisely and repeatedly. This was novel, and at first, controversial. The idea of interchangeable parts was actually around for a while before Ford had the idea to try to use it for his cars (Rolls Royce, famously well engineered cars, were all hand built - incredibly well made, but only fitted to themselves individually, and very slow to make accordingly).
Ford realized that pipelining tasks made them much faster. His first task was to build a distributor out of standard parts - which cut the time down by something like 80%. This wasn’t a full factory line, but a single long bench with 10 workers.
As he scaled up to the whole factory, he realized that he, too, was only being precise to himself - he got in an argument with a vendor of ball bearings who turned out to be more accurate than he was.
This all led to the core insight: the entire system had to rest on absolute precision and known tolerances for all of the parts. He could build a large, and fast, production line, but only if he could depend on that.
That production line is our first example of this pattern. The individual workers in the line weren’t particularly reliable - they weren’t the craftsmen of before, just ordinary folks (and in fact, there was a lot of political turmoil and consternation about this from those guilds). But there were a lot of them - they could scale quite a bit. The stable foundation was the combination of standardized parts and precise machining. Those took some work to design, and care to execute well, but once you had that foundation, you could scale the “cheaper” labor very quickly, and the whole system accordingly.
Of course, both parts of the system continually were optimized - just because the pattern worked doesn’t mean it was “done”. The 1910’s version of prompt engineering is the line worker, and much effort has gone into improving that fundamentally unreliable component of the overall system.
Distributed Systems
Prior to Google2 and other cloud companies, there was a pattern very similar to the way complex objects were built in the age before precision and mass production. If you wanted to scale a service, you built (or bought) a very complex, expensive, carefully designed piece of hardware - a mainframe. These never went down, could have components “hot swapped” and were generally expensive and scarce, but robust.
The insight Google and others had is very much like the insight Ford had: if the software system was carefully designed and robust in the right ways (hard to do but possible), the hardware could be flaky and chaotic. Since the software only had to be designed once and then could be copied as much as you wanted (similar to how the design of the Model T only needed to happen once but then could be used to make a lot of parts), this meant the overall system could scale much faster now - you could have all the cheap hardware you wanted, it could be flaky, and the robust foundation of containerized, software-defined datacenter, code would let it run reliably.
Same pattern: a system of a robust foundation that is hard to scale but which can be leveraged, supporting a chaotic component that is easy to scale.
OK what the hell does this have to do with AI coding?
Teams like the Amplifier team I work with at Microsoft, or the StrongDM team Justin McCarthy runs, or Jessie Vincent’s team are all working towards agentic coding: systems where programmers largely don’t have to look at or manage code directly, and the LLMs do all of the programming. But there’s a problem: LLMs are unreliable. But they do scale! We can have them do as much work as we want. So we have one half of the pattern above: a highly scalable, but chaotic component that can do some of the work.
So now we need the harder to scale, but more robust foundation. In the above examples, this was a bit easier to see because in both cases, it’s “software” that is the robust foundation, and hardware that is the chaotic piece (I put that in quotes because the software of Henry Ford’s era was the designs and then practices of precision - but those ideas were “software” in some sense). But in our current example, both pieces are software.
People get tripped up here because everything involved is “just software.” Some is written by humans, some by LLMs, and it all lives in the same repo, so it looks like one homogenous thing. But the layers aren’t the same at all — they’re different kinds of software with different jobs:
Cognitive software: the LLM layer that proposes solutions, explores, rewrites, hallucinates, generates.
Metacognitive software: the engineered substrate that tests, constrains, attacks, filters, and enforces consistency.
The cognitive layer of software is the chaotic part that scales. The metacognitive is the robust, stable foundation that constrains and enables that scale.
The metacognitive layer is kind of like the “gauge blocks” of agentic coding — the small, carefully engineered, deterministic parts that everything else has to align against. The LLM layer can be unreliable, that’s ok as long as it’s on that foundation. The metacognitive layer is reliable3. It’s the substrate that forces coherence.
A concrete example: one LLM proposes a diff, but nothing gets committed until a separate “attacker” model tries to break it, a consistency checker validates the interfaces, and the diff survives a run in a deterministic “shadow world” built from the real system’s APIs. The builder doesn’t need to be reliable. The attacker doesn’t need to be reliable. The deterministic substrate underneath them does. That’s the reliable core enabling the chaotic layer above it.
Another example: some teams enforce “proof by code” — every generated change must come with tests that demonstrate it works rather than rhetorical explanations from the model. The tests then run inside a frozen environment that rejects anything nondeterministic or incoherent. Again: the creativity can be sloppy, but the substrate is not.
This part reminds me a bit of the early parts of the cloud, where different teams would try different scaling, failover, and consistency strategies, and gradually the patterns that didn’t have strange edge cases would emerge and take over. Because of the historical engineering pattern above, and the day-to-day progress and experimentation I can see, even though we aren’t “done” yet, I am pretty confident this will work - many teams are working on those foundations, some answers are already emerging, and it seems just a matter of time before we have a clear understanding of how to build robust, complex, large software production from flaky LLM components sitting on top of a solid foundation.
The criticism misses the point
Since its early days, there is a lot of discussion and criticism of this idea - fair enough. But I think it misses the point. The thing to look at isn’t whether agentic coding can do all of the things we do with software today - that would be nice, but in neither of my examples above does the new system replace the old one entirely. There are still bespoke craftsmen, handmade objects, and mainframes.
The point instead is that agentic coding is not about eliminating mistakes; it’s about engineering the substrate, so mistakes don’t matter. We can see the pattern, we think the models have gotten to a point of stability where they are usable4 for a system like this, and we can see first light.
The interesting part is all of the new value that can be built from this proven historical engineering pattern, and the new disruptions and opportunities it will bring5.
A lot of this section comes from a great book, “The Perfectionists”, by Simon Winchester, which you should read if you want to understand this history better.
As far as I know - I’m sure someone had some of these ideas earlier, but I don’t know the history well.
what’s in it? Lots of things are emerging: adversarial reviewers, robust shadow testing, proof by code, enforced agentic workflows etc.
The chaotic part of the system can’t be infinitely chaotic: if flaky hardware failed every 10 seconds, it would be pretty hard to build a service.
And I didn’t really say it in this already long essay, but I think this isn’t just true for software. I think ultimately any knowledge work will be amenable to this same pattern. What that means will look different for different fields - but once we get past “work slop” and we build on robust platforms, similar transformations should be possible across a whole range of fields.


You ask for corrections on the history of clusters built out of unreliable components. I asked ChatGPT and it mentioned Beowulf by NASA started 1994-1995. It also mentioned Yahoo, Altavista, Amazon, Google as doing large clusters in early 2000s.