Humans have a tendency to see faces where there aren’t any - this is called “pareidolia”. It’s hard coded, and difficult to shake off. We have a similar “bug” that we exploit to see movies - even though we are viewing many still frames rapidly, our eye lags enough that we splice them together and see motion.
As we start to build experiences that are more stateful, there’s a temptation to see chatbots and other agents as “alive” in some sense - or at least continuous. If you talk to someone, you expect that they remember from moment to moment the recent parts of the conversation. A stateful chatbot does a good job of imitating this - some of the recent conversation is passed in, and even though all the underlying model is doing is token prediction of that long history, it “feels” present, intentional, and aware the way a human does.
This breaks down rapidly when you try to get an iterated process to run independently for any length of time. Many teams are trying to get “autonomous” agents to work but there is very little out there that doesn’t have a human in the loop constantly correcting, giving context, and doing “metacognition”. It’s confusing.
There is no multi-agent. There isn’t even an “agent”. There is always only a series of disconnected prompts, with no state changing in the base model at all - all of the state is what is passed in, and no prompt is “aware” of the rest of the conversation beyond what’s passed to it.
This is, to me, a bit like watching some Claymation movie and then being surprised when you see the clay model in real life and it’s not alive. “But I just saw it moving! It seemed so real!” Of course, all you saw was a series of disconnected stills, where a person moved the model “invisibly” between each still. The clay model only gave the impression of moving, and your eyes literally fooled you.
This doesn’t mean memory isn’t useful, or chatbots can’t be made more effective. It just means that our instinct is off, and things that seem like they should work, like independent action, don’t, because the underlying cognitive mechanism is different. It’s discrete, not continuous.
That can be ok - we don’t sit in a movie theatre fighting against our retinal lag - we just watch and enjoy the movie. But we aren’t like dogs barking at the TV - we know it’s an illusion, even if it’s a very fun one. It’s not good or bad that agents have this property - but it is helpful to understand what’s really happening so we can build with it, rather than being like a director who’s frustrated that they can’t find a living clay model to film.
> it is helpful to understand what's really happening so we can build with it, rather than being like a director whơ's frustrated that they can't find a living clay model to film.
I'm happy to hear this being expressed in public. It's not sexy to say, but it's reality.