In the past couple of months I’ve contributed to a series of papers — RE-Bench, HCAST, and, bringing them together, Measuring AI Ability to Complete Long Tasks — that I think have painted a clearer picture of AI progress. (This blog post on the latter paper provides helpful background.)
Here are some takes informed by the outputs of these projects.
(It’s totally wild but I think robustly true that) recent historical progress in AI has followed remarkably steady trends.
We see this with scaling laws for loss in data and parameter count, but also in the central RE-Bench and Measuring AI Ability to Complete Long Tasks plots.
My experience inside METR has only made this take feel more viscerally correct. I remember staring at best-of-{2,4,6,8,16} results on RE-Bench, hearing that lines would start to bend, then extending to {32,64,128+} and seeing no signs of bending. I remember contributing to the plot of different model’s task length horizon over calendar time, having no a priori reason for thinking the line should be straight (in log space!) beyond that being true for all lines in AI, then seeing the extremely straight output. (Followed by a lot of unsuccessful poking at this central result, documented in the paper.)
Of course, none of this is to say that curves will not bend.
(But in which direction?)
The uptick at the end of the task length over time plot is maybe understated.
Belying popular narrative, progress on the task length over time plot in 2024 appears to be slightly steeper than pre-2024 progress. The change in trend looks to be within statistical noise. Regardless, I think people are still sleeping on how surprising recent reasoning model results have been. This progress seems to me to have two important implications. First, that we have figured out (scalable?) pipelines for (loosely speaking) having LLMs teach themselves to perform better in environments where it is possible albeit unlikely that they sometimes take performant actions/get correct answers, steepening the task length curve. Second, that the slope of the post-training task length curve has been steepening in pre-training compute.
On the other hand, slowed pre-training scaling laws might imply a slower exponential over calendar time. I’m unclear on how these factors might net out.
Current performance is underrated, even at the frontier.
Tweet on this in the context of elicitation effort.
Twitter thread on this in the context of human comparisons.
Model intelligence is really confusing.
I think it just is correct that models both show signs of being able to complete long-horizon, very challenging tasks successfully, and also that they are totally unhelpful for some workflows, owing to lack of context on the users work, poor UI, and lack of robustness.
In some ways intelligence has turned out to be a lot more straightforward than I might have expected, yet the topology is a lot weirder.
My current guess is that at least poor UI and lack of robustness will be solved by just throwing more compute + human effort at the problem.
Timelines…
It seems to me that you should have somewhat short timelines:
That said, I think I have longer timelines vs. my colleagues. Some unclear gestures as to why this might be the case:
Overall, my hand-wavey guess is something like: 6-month doubling time, starting from 8x lower-base -> about 5 years to 1-month AGIs; this will feel deficient in some respect, but sufficient for crazy things to start happening.
METR is fantastic and you should join.