This post looks at how LLMs learn, why they're so powerful, and why in the end they are also fundamentally limited.
A typical attack on LLMs is that they are "just predicting the next word." And as it goes with these kind of arguments, the word "just" is incredibly load-bearing.
To start, they are not predicting the next word. They're predicting the next token. Tokens are embedded in a complex, multi-dimensional space and humans are notoriously poor at grasping anything beyond 3D.
This attack also ignores improvements like MEAP and Forgetful Causal Masking, which disturb the model's previous context and force it to learn better attentional patterns over its input. And it ignores multi-token prediction, where models must predict several tokens at once.
But more than that, it dismisses the incredible scale of both the pre-training data and the model's context window. Here, the argument takes on an anthropological twist by implicitly comparing it to the way humans speak.
The way humans speak is famously unpredictable and non-linear. We don't continuously draw from our entire voacbulary at every moment. And we don't hold the total, fixed set of words said so far in our working memory (we could only hold 7 +/- 2 of them at a time, anyway). We deal with the evolving gestalt of the conversation, its concepts, and the relationships between them. Crucially, we actively work in feedback about our words and the impact they're having.
In a good conversation, it should be very rare to know the full set of words that are about to come next. Words matter, absolutely. And the wrong choice with the wrong connotations can and will actively derail things. But we mainly deal with the loose idea of where a conversation itself is going, almost independent of the specific, cummulative string of words that are carrying it.
When LLMs are predicting the next token(s), they are directly tapped into and rely on a fixed, enormous vocabulary and the previous context. There's no good analogy for this in the way humans choose the words we speak. And, during pre-training, they predict these next-tokens over an amount of text that would take even the fastest speed-readers many lifetimes to pour over.
There's a parallel here to the way we think about distances on earth versus out in space. Most people have a rough handle on how far away a mile is, or on how far apart two cities in their own country are. But there is no good, grounded mental model for how far away Pluto is, let alone our closest neighboring galaxy. Put frankly, our brains did not evolve and are simply not built to understand what an enormous distance that cosmic gap truly is, or what it would feel like to travel it. When we say that LLMs are "just predicting the next word," it's like saying we could "just" reasonably travel to Andromeda.
In short, this argument isn't technically wrong on its face. But the mental orders of magnitude of its makers are way off.
With pre-training, the model learns the implicit mappings and dynamics of human syntax and communication. All they see are incredibly long sequences of tokens, are able to attend and process over their relationships thanks for their attention mechanisms. From these sequences they distill patterns. They're able to learn (or at least know) what sequences make sense in the landscape of all possible combinations of letters and punctuation.
Imagine we have an empty string of length N
. For each slot in this empty string, we could arbitrarily pick any possible token. This is already and obviously an enormous space of possibilities. Models intially make this tractable by having a fixed vocabulary of size V
. This already constrains where they can be in the space of all possible arrangement of tokens. Pre-training effectively show the models the slice(s) of this space that cover actual, informative human communication.
Once they are equipped with this broad, mechnical understanding of human speech, post-training attempts to give the models a way to meaningfully traverse this space.
Whereas pre-training happens on massive and unlabeled text datasets, post-training directly uses a few high-quality, labeled examples.
The positive results of post-training are seen plainly by the millions of people who now use LLMs. Even if we grant the "just next token" argument for a moment, its implied limitations fall flat on their face when we ask post-trained models for help on a task.
Every single week, models are getting better. And every few months, they unlock new capabilities that seemed previously unthinkable to many NLP researchers. Critics are forced to move the goalposts, and one gets the impression that they're playing a losing game. "I'm not owned! I'm not owned!" they scream into the night.
The most recent model improvements, such as complex reasoners, are happening at this post-training stage. And it is here that we can look at why exactly these models become so powerful. More precisely, how they become so useful to us. It also gives us an inkling of why there are limited, and why Agents are the way forward.
We've established that pre-training explicitly sets models over the regions of token-space where meaningful human communication happens. These pre-trained, or base, models can be made to generate words. But they speak in tongues and fever dreams. They're not outputting completely random, meaningless tokens like an untrained model would. But they do ramble and wander in ways that don't make sense.
This is where post-training shines. Post-training examples are carefully curated to represent traversals of this meaningful token-space. They are made up of token sequences that cover useful problem and question areas. And the specific trajectories are a causal, linear chain that ends up at said problem being solved or question being answered.
Post-training, with billions of parameters coordinating these traversals, is where useful model capabilities emerge.
Thinking of post-training in this way gives us direct evidence of why LLMs have become so powerful. Any problem space where solutions can be faithfully represented and arrived to as linear, causal sequences of information can in theory be solved by post-training LLMs.
Linear, causal sequences are an enormous subspace of the actions human beings take in the world. These are the actions our brains and bodies have been tuned for over millenia. However, they are far from the complete picture. Our most challenging and interesting problems are often non-linear and hierarchical. It is difficult, if not impossible, to accurately represent them as a linear flow in token space in any resonable amount of time or space.
To tackle these problems, LLMs would need their post-training to incorporate feedback loops, and to operate intermittently at different levels of abstraction (this is part of why YAML tags became such powerful aids for complex tasks). There are active areas of research to make this possible. But it is a young approach and the waters are still murky.
However, we do currently have a way of bootstrapping and injecting this non-linear, hierarchical processing into LLMs: Agents. I believe this is why Agents have so much hype around them. People can almost feel, if not quite put into words, that this is what we need for LLMs to reach the next level.
I explored this a bit more in my post about Agents. To summarize the takeaway: Agents give LLMs a way to break away from the linear, discrete turn-based interactions. By retrieving external context, they cbreak the isolated linearity of a conversation. And by calling external tools (possibly even other LLMs), they both break linearity and dip into different levels of abstraction.
This is the main power of Agents as of writing. They give us a way of directly improving the dynamics of the trajectories that models travel in their token space. The correct conclusion is not to look at Agents as salvation in and of themsleves, but to realize that the weaving in non-linearity and hierarchy is crucial to solving our most complex, interesting problems. Agents simply happen to be the most direct, accessible way of doing that. At least for now.