The pace of change, agents, and hallucinations.

Jan 19, 2025

A surrealistic header image for a blog post combining elements of old-world oil painting, 90s technopop, and futuristic AI themes. The scene features a Renaissance-style figure in deep contemplation, surrounded by glowing holographic charts and futuristic data streams. The background is a blend of surreal clouds and neon grids, evoking both timeless art and cutting-edge technology. A mysterious AI figure hovers, with features blending between human-like and digital constructs. The overall color palette combines muted, classic tones with vibrant neon highlights.

A while back I wrote something about how it’s much easier to elaborate on an idea once you have it, then to know where to look in the first place. This can be said as “0 to 1 is hard, 1 to many is easier”. We’re really starting to see this impact in the AI world.

A year ago (plus a bit) there was only 1 “GPT-4” level model, and it wasn’t clear how hard it would be to build another one. Now there are dozens. When DeepSeek came out with theirs in December, with a training cost three orders of magnitude lower, I thought “huh, if we get another three, then you can make one for about $6K by the end of next year”. Roughly a week after I had that thought, Sky-T1 came out, which is a claimed o1 preview level (so, even more interesting than GPT-4 level) for all of $450 in training. Four orders of magnitude, not in a year but in a week! Things are moving fast.

Related to this, I think there is a huge gap between what is possible and what people are doing. Inference costs and latency are coming down very rapidly and will likely continue to do so. Are there solutions to hard problems out there that we can get to just with brute force? When I look at the hallucination/error problem, it feels a little bit like distributed service reliability. In that world, reliability didn’t come from making the hardware never fail (which is sort of like making the LLM never hallucinate in my analogy), but by managing failure gracefully by adding a software layer and redundancy. What if we can do that with inference - does a lot of checking eliminate errors? Is it like uptime, where you can get more 9’s but each one costs exponentially more, and there are never any perfect guarantees (so, we can get fewer mistakes but never 0, and the higher accuracy costs much more in extra inference)? Or the example from last week - what can we usefully do by throwing a lot of inference at harder data analysis problems? Are there regular processes that your company has, where humans do a lot of debate, that can be streamlined by having the models work on them?

One last thought for the week. I’ve been contemplating the idea of agents a bit. One thing that bothers me about the current dialog is that it violates one of my principles of product design, which is “don’t break faith with the user”. That is, if you present some model of how your product works, it should stay faithful to that model. Don’t lie to the user, in other words.

“Agent” makes a promise it can’t keep: that the AI system is a person. It’s not - it’s a program that does a good job of imitation the behavior of people, but it is missing all kinds of pieces like memory, continuity, self-awareness, and, actually, agency. The problem comes when a user understands and accepts this kind of idea, depends on it, and then is let down by missing behavior. The ambition is great, and “it’s just a person, talk to it” is a really compelling model because it’s so simple, but it never ends well when we lie to users like this.

It’s tough to build products when the world is moving as fast as it is now. Stay focused on user value and understanding, and don’t forget to experiment and make some messes - there are so many new things to try now.

Sunday Letters

Discussion about this post