It’s getting harder to measure just how good AI is getting
Toward the end of 2024, I offered a take on all the talk about whether AI’s “scaling laws” were hitting a real-life technical wall. I argued that the question matters less than many think: There are existing AI systems powerful enough to profoundly change our world, and the next few years are going to be defined by progress in AI, whether the scaling laws hold or not.
It’s always a risky business prognosticating about AI, because you can be proven wrong so fast. It’s embarrassing enough as a writer when your predictions for the upcoming year don’t pan out. When your predictions for the upcoming week are proven false? That’s pretty bad.
But less than a week after I wrote that piece, OpenAI’s end-of-year series of releases included their latest large language model (LLM), o3. o3 does not exactly put the lie to claims that the scaling laws that used to define AI progress don’t work quite that well anymore going forward, but it definitively puts the lie to the claim that AI progress is hitting a wall.
o3 is really, really impressive. In fact, to appreciate how impressive it is we’re going to have to digress a little into the science of how we measure AI systems.
Standardized tests for robots
If you want to compare two language models, you want to measure the performance of each of them on a set of........
© Vox
