Oct 22, 2025

When you test frontier models on yesterday’s benchmarks, you’ll miss today’s breakthroughs. You're weighing an elephant with a bathroom scale.

8 Comments

Yoemsri

Oct 22

Beautiful analysis, as Karpathy highlighted agentic behaviors is next frontier - Still think contextualization is a big challenge rn to improve on real world complex tasks (window size + proper data contexts).

Wonderful post

✊

What a great explained analysis Nicolas 👌

Thank you for another post! I agree as someone tinkering to push all the models, i have my internal benchmarking system to test models as they come out and find out what are most suited for what task and there is definitely a positive slope,

as a mathmatitan i can’t wait new models to start farting out incomprehensible proofs well soon see how little we know

as for the general consumers they are just the loudest and it’s that simple, those who know, know.

Most people measure the speed of the car by going the speed limit, but you right, they need to floor it to notice it’s a rocket, good stuff thanks for sharing

Reply (1)

Share