I’ve been thinking for a while about why we are seeing so little independence from LLM-based applications, even though they are very capable. It’s puzzling - GPT-4 and others can describe a plan very well but usually struggle to execute long or complex plans, particularly where there is judgement or correction needed.
One answer to this is that it’s showing the limits of RAG. No matter what architecture you build out of inferences, each one is fundamentally isolated from the others and from the overall task. All it has is what you pull out of memory to construct it. So if you have imperfect precision or recall, you can easily get confused - you might leave out a fact that the model makes up, but in a different way (say, the main character in a book has a cat for a pet instead of a doc). That memory gets stored and if the system is iterating, now the memories are in conflict. Or the model may get creative and add new facts that are confusing or irrelevant. Over time, this accumulates as noise in the memory store, and the model has a harder and harder time making good decisions.
Or look at task completion. You have to know a lot to decide if a particular sub-task of a large project is done or not. You have to understand what it is, where it fits in, what’s been done, what other tasks like it are considered done or not done (few shot examples), and whether it is “done enough” for this part of the process sometimes. That’s a lot to get into a single prompt, and then the model has to accurately make the same choice (and not add new sub tasks, etc) each time.
All of this has led me to suspect that a good pattern for getting more complex behavior out of LLM-based systems is (ready for this?): cognition in the model, metacognition in code.
What the heck does that mean? It means that there are tasks where the “metacognition” of the task - the pattern you follow to do it - is relatively known and stable, and can be described in code. This gives the stochastic (random) model a sort of “exoskeleton” to lean on to get through complex processes.
A simple example is something I wrote almost a year ago - the “textbook factory”. This was a little test to see if I could go from a single sentence to a whole textbook for a year of a subject (full curriculum, chapters, lessons and plans, questions and answers, teacher’s guide and table of contents). The sentence needs to contain level and subject and optionally teaching voice and learning style. The code is very simple - a python notebook with a bit of setup and some loops that take model output (like the curriculum) and generate lists (of chapters, say) and then perform further inference on each item in the list (generate list of lessons, generate lesson, generate problems and answers etc).
The “metacognitive pattern” here is writing the book - that’s always roughly the same, it has the same pieces and topology. The code does the work of orchestrating the roughly 600 API calls it takes to generate a text book. The model does all the heavy lifting - parsing the initial sentence (which is totally arbitrary language) to find what the book should be, generating all of the content. This program is very reliable but very rigid and limited - the metacognitive pattern is very simple. But it is robust - getting the model to do this on it’s own, even long-context models, is hard - they tend to lose coherence at large numbers of output tokens.
The need to write code for metacognition will diminish as the models get better. There may be intermediate steps that the models can generate to help - this happens already today where ChatGPT will write Python sometimes to help itself solve problems. Right now those programs are fairly simple, but they’ll get more complex as the model increases in capability.
At some point we may develop better memory strategies and the problems above with RAG will get less severe. Likely at the same time we get better at generating “metacognitive infrastructure”. These both seem like good areas to spend time thinking about, that will likely only benefit from better base models.
(Yeah, I know, this one was really technical and abstract - for my non-technical readers (hi mom!), I’m sorry).
Dear Sam – I love this post! I’m a professor at the Australian National University and visiting professor at Harvard. I’ve written a series of books that use metacognitive approaches to understand complex and contested fields, such as the pushback against economic globalization (eg Six Faces of Globalization). We realised about 9 months ago that these metacognitive approaches work really well with LLMs, so we started developing AI tools to apply these approaches in policy making (https://www.dragonflythinking.net/). Our approaches share a lot in common with what you are describing here and in your Schillace Laws re the role of metacognition and the combination of semantics and syntax. Would be great to discuss.