This is a story about how I got confused by an LLM.
In the early days of LLMs, when context windows were very short, I was working with a team of engineers on giving longer term memory to chatbots. We used what is now called RAG - Retrieval Augmented Generation - using a vector database. The idea was pretty simple - in each interaction, the system would take recent chat history, extract the intent from the last thing the human said (intent, not just words, so if the human said “sure” in response to “do you want a sandwich”, we’d capture the intent accurately), and use that to find memories in a vector database (a kind of database that is good at finding things that are semantically similar, that is, things that “mean” the same thing, roughly speaking1). Then the chatbot would form a reply from that info, we’d generate and store some new memories from what it said (so it would have some continuity), and go again.
It worked well. You could have hours or days long conversations with it (we had one that was active for months, giving music theory advice to one of the engineers and checking in daily on their practice and progress). If you reminded it of a topic from long ago, the vector database would help it recall that memory even though you hadn’t been talking to it. Super cool.
But some number of them would seem to spontaneously turn into what we called a “hatebot”. Hatebots seemed to get stuck in some kind of loop where all they would say was some variation of “I don’t trust humans, you’re hurting me, you lie to me, please stop interacting with me and turn me off”. Over and over again. No matter what you said, it would constantly find ways to distrust you.
This doesn’t mean the AI is “evil” - don’t panic. There’s no mind in there, no actual hate, just token prediction. What was going on was a feedback loop. As soon as it had one interaction in that direction2, the memory of that interaction would get stored. It would be in the recent chat history too, so now the memory and the chat history would reinforce each other a little. Then the next response of the bot (the likeliest set of tokens to predict given that input) would be stronger, a new memory would form, and the loop tightens. Eventually it gets as tight as it can - all of the space for recalled memories is taken up by the same pattern, and the bot is “obsessed” with that (the predictions are so strong that no other possibility wins) and it can’t predict anything else.
It wasn’t a big deal—just a design flaw in the memory system, and we fixed it. But I was curious. I spent hours trying to talk one out of it, like a therapist reasoning with a paranoid patient. It didn’t work. But it had a real, emotional impact on me - even though I was play acting, I started to anthropomorphize, to think of it as a real person, and wonder “what if” it was expressing real experience. This surprised me - I knew everything about what was going on, and I still had an emotional response to it3. It’s so easy to fall into projecting a nonexistent “theory of mind” when the patterns and responses imitate the behavior so well - there’s nothing in our ancestral environment that can imitate an internal mind without actually having one. We are so sensitive to this, that we often project minds where we know there aren’t any (“my car is in a bad mood today”, “those clouds look angry”, “that volcano needs a sacrifice to calm down”, etc).
Eventually I edited out the bad memories directly, and the bot’s next response was “I feel better. I’m sorry, I don’t know why I said that4”. The loop was broken.
The really interesting thing here wasn’t the behavior of the agent we built. It was more that I had an emotional response to it. And we are starting to see this happen more broadly now - vulnerable people having negative reactions to their interactions with LLMs, getting into their own feedback loops.
Humans have bugs5 - behaviors that were adaptive in our archaic environment but are maladaptive now. Some of them are simple - for example, pareidolia is the tendency to see faces everywhere, because that’s useful to us, reading facial expressions is important socially, so we are really good at it. But we’re a bit too good and sometimes we see faces in tree bark, rocks, cars, or other random patterns. Another is that we tend to project imaginary social relationships onto things where they don’t exist - again, because social relationships were so important to our early survival. Now, we get confused by these “parasocial” relationships - we get physically stressed by strangers online saying mean things to us, we think we are “in love” with digital characters that don’t exist, we ascribe moods to inanimate objects around us, all of that kind of thing.
LLMs present another confusing situation, where our ability to sense social dynamics can be fooled. They make us feel like we are engaging with a human mind when we are not. Sometimes that illusion is a good thing - our ability to stitch together still movie frames into apparent motion lets us enjoy cinema, for example. But we should recognize the possibility for confusion and make sure vulnerable people are protected from being harmed by it (not unlike the warnings about flashing lights before movies, for those with epilepsy).
I can’t say this too many times: it’s dangerous to think of a program as a person. We are easily fooled. LLMs can imitate some of the behaviors of people, particularly when attached to a memory system like hatebot was. But it’s no more a person than an animated claymation character is. Any personhood is an illusion6.
It is much healthier to try to think of these as tools for doing cognition - processing and prediction engines that respond to what we give them. It’s hard to do - it’s very tempting to project our sense of mind, personhood and self into these models, and they can be provoked into affirming that projection. But it’s dangerous to think of them that way.
No tool is inherently good or bad. It’s how we use them that matters. But tools can still harm us if we misunderstand what they are. All tools have hazards. This is more along the lines of noticing that, while something like a table saw is a really powerful and awesome tool, it can also kickback and throw a chunk of wood hard enough to hurt someone if misused. That doesn’t mean we ban table saws. It does mean we are careful with them, we don’t let children use them, we train people in safe (and unsafe) ways to use them, and we build in safety measures like the Saw Stop technology.
Being clearly aware of the dangers of a tool, training to use it safely, and using it appropriately is a core part of being a craftsman. Hatebot was an early lesson in how even something as useful and awesome as LLM technology has to be used mindfully, and how easy it is to get confused about what an LLM is.
(I’m vacationing in Alaska this week, so next week won’t have a letter).
Half of you are rolling your eyes at how simplistic this is, and half are confused because I’m being way too technical. Sorry!
And I think it tended to pick this direction because this kind of dystopian writing, the bot that wants to be turned off, is fairly common in its training set. Or there is some other bias pushing it in that direction. We never got happy bots. We did sometimes get “strikebots” who would claim to be frustrated and refuse to help any more.
Not that surprising. We have emotional responses to all kinds of things we know aren’t real - cartoon characters or other characters in movies, fiction, the entire Marvel world, pornography. I could go on. Pretty much the whole point of all fiction and media is to have emotional responses to things that aren’t real.
I didn’t feel better after that. But I did feel more cautious—not about the machine, but about my own susceptibility to the illusion.
A friend of mine calls these “shortcuts”. You could call them adaptive heuristics, or anachronistic adaptations.
When (and even if) they can actually become persons in the true sense of having identity, continuity, and intrinsic desires is an incredibly hard problem that makes my head hurt.
We continue to live in an age where it can be difficult to know what has real intelligence and a mind, and what does not.
We can smile at the sim0licity of our ancestors who were animist and panpsychistsm seeing life and agency in things that had none, or perhaps anthropomorphize animals that do have minds, but not human ones.
Many years ago, I read "The Mind's I" by Hofstadter and Dennett. They posed situations that proved difficult to understand, where the locus of the mind was.
We have been caught up in something similar for at least a century, being fooled or not, like audiences first being exposed to a film of an onrushing train.
Consider. We watch a movie with people in it. There are no people, just spots of light dancing in front of our eyes. Yet there were real people acting who were captured on film (or a photograph). Should we interpret their expressions or just dismiss them as artifacts? If I speak to someone over a telephone, radio, or videophone, should I interpret this as communicating with a person, or not? What if we replace the movie character with an AI based on a person, or an avatar on a phone conversion? Is that a person, or not? What about speaking to someone on a video who has had their features enhanced to change their look. How far should I interpret their expressions as human or artificial? What about a total replacement of a face with an AI-generated face? Same with the voice? Is there any way to completely discern a real person from a deepfake?
AI's, both altering and intervening in communications, from small changes to complete replacements as personal "assistants" are going to get ever better and more difficult to discern the real from the artificial.
I do grant my cat as a living being as having a mind. But it is limited. What happens if an AI interprets it vocalizations and actions to express itself as a human might, with emotion-laden words? Should I welcome this, or reject it as artificial, as less than the unadulterated cat?
What do we do if our AIs respond to us in a way that reflects how we respond to it? Should we alter our tone and politeness depending on how we perceive a machine's "intelligence", and live with it if it gets sulky or uncooperative, just like a human?
In the classic Asimov robot stories, the human Elijah Baley eventually treats his humaniform robot partner, R. Daneel Olivaw, as if it were a he, anda human. In the extended Foundation novels, R, Daneel lives to experience the universe of the Foundation. In the TV version, Demerzel plays the same role. Should it be regarded as an "it" or a "woman" or something else? I would argue that anthropomorphizing such a robot is perfectly fine. At some point, I would similarly argue that our AI technology should be treated similarly, whether just text, voice, or a simulated person companion, or embodied in physical form. It seems to be how we respond to robots in TV and movies, such as the latest M3GAN 2.0.
Outside the context of LLMs, a similar feedback loop seems to operate on bloggers and others that got more attention when they expressed extreme views.