METR benchmark shows AI models now complete tasks requiring 14 hours of human labor.

Englishعربي

METR benchmark shows AI models now complete tasks requiring 14 hours of human labor. | Srmed

A viral chart circulating widely in artificial intelligence discussions measures how quickly AI models can complete complex tasks compared to humans, with the latest models from companies like Anthropic now handling jobs that take people up to 14 hours or more. This "time horizon" benchmark, prominently featured on the front page of METR's website, has reached 4.6 as of February 2026, showing AI's rapid progress from 30-second tasks in late 2022 to multi-hour endeavors today. According to Bloomberg's Odd Lots podcast and related coverage, METR—short for Model Evaluation and Threat Research—created this chart to gauge AI's potential for autonomous, complex work, raising alarms about risks like recursive self-improvement where machines could evolve without human oversight.

METR, led by President Chris Painter, focuses on safety by evaluating how well AI handles real-world, intricate problems that demand sustained effort, such as computing or multi-step reasoning. As explained in the Odd Lots episode, the chart tracks the longest tasks an AI can reliably finish, using human performance as a baseline—essentially asking how many hours of human-equivalent work the model can replicate. Painter and his team emphasize that this isn't about AI replacing all jobs but understanding its scaling limits, especially amid fears of "escape velocity" where improvements accelerate uncontrollably. The chart's upward swoosh, often likened to Moore's Law for AI, has fueled both excitement and debate, dominating online discourse as reported by Bloomberg Technology.

While the chart captivates AI enthusiasts for suggesting exponential gains—doubling capabilities every few months—experts caution against oversimplification. Derek Thompson, in his analysis, debunks myths like the idea that it predicts AI takeover of all human labor, noting METR's metrics are narrow and task-specific, not a universal job proxy. This nuance matters because hype around the chart drives investment and policy, yet a more grounded view predicts chaotic progress rather than smooth economic dominance. METR's work, as discussed by Painter, prioritizes threat research over productivity boasts, distinguishing it from public perceptions of pure advancement.

The chart's virality underscores broader AI trends, like ChatGPT's explosive adoption—hitting 100 million users faster than the internet or cellphones, per data visualizations from analysts like Deb Liu. Consumer AI usage continues climbing, with ChatGPT now at hundreds of millions of weekly actives, though estimates vary due to multi-account issues. Stanford's AI Index highlights foundation models' dominance and benchmark shifts, amplifying why METR's metric resonates amid skyrocketing investments and global competition led by the U.S.

This benchmark influences everyone from developers to regulators, as it signals when AI might operate independently in high-stakes areas like research or infrastructure. Affected parties include AI firms racing to scale safely, workers in knowledge tasks facing disruption, and policymakers debating oversight. Next steps involve METR refining measurements—Painter details ongoing human baseline tweaks—and broader scrutiny, with episodes like Odd Lots bridging expert insights to the public. As AI charts like this proliferate, they demystify progress but demand careful interpretation to avoid inflated expectations or undue alarm.