Nature of intelligence seems to be about building blocks: smaller
building blocks, less intelligent entities, will unlock bigger building
blocks, more intelligent entities, rather than following some law of
conservation of intelligence that states you cannot get more intelligent
things out of less intelligent things.
There is several empirical evidence we see this: first of all, we can
use a relatively dumb model to label a dataset, and then train a
smarter model using this dataset.
We see this in how DeepSeek trained R1-Zero out of V3, and then used
R1-Zero to generate cold start reasoning data, and then used the
cold-start reasoning data to further train DeepSeek-R1. Maybe, we
can argue that this process just changed how the model behaves,
i.e. the model reasons more often, rather than how smart the model is.
But still, the outcome is good: the model unlocked reasoning capability
and can achieve a lot more.
Another empirical thing we see is in multi-modality. Once we train a
dumb text-image model, it can actually be used to do a variety of
tasks like labeling datasets, filtering datasets, etc. The reason it works
is that although this dumb model cannot generate good enough
pictures, it actually does really well on these easier tasks: labeling
datasets, filtering datasets. And doing well on these easier tasks
enables the model to further build the next better model. And by the
same token, a decent text-image generation will be used to create an
even better model.
Now why does this work at all? What seems to happen is that these
capabilities are based on some kind of intelligence threshold. And reaching
a threshold is actually all that matters! Essentially, the key is that there are
different capabilities, and to unlock each capability requires a
different level of intelligence. Maybe parameter size is a good
indicator of the potential the model can reach. And once a model
reaches the threshold for a capability, it doesn’t really matter how
dumb the model is at other things, it can still be as useful as a super
smart model on that particular thing it’s good at.
A dumb model actually reaches some intelligence threshold that
unlocks its certain capability, such as labeling a dataset as positive
versus negative. And because it reached that threshold, we can now
use it to serve as a building block for a more intelligent model by
using its labeled dataset to make the next model better. So intuitively,
this feels like intelligence is like Legos, where bigger ones are built
out of smaller pieces.