· Mark James
How Long Can an AI Work Before Failing?
Start measuring AI by how long it can work before it reliably breaks. METR's new 'time-horizon' study gives leaders a clean way to size capability, reliability, and risk.
Time-Horizon as the executive's metric
Based on METR's "Measuring AI Ability to Complete Long Tasks" - translating their research into executive action
TL;DR: Start measuring AI by how long it can work before it reliably breaks. METR's new "time-horizon" study gives leaders a clean way to size capability, reliability, and risk—and to forecast when specific workflows will tip from "assist" to "automate."
The metric (in plain English)
METR (Model Evaluation & Threat Research) just dropped a 45-page paper with the most pragmatic AI metric I've seen. They call it "time-horizon"—the length of tasks (by typical human hours) an AI agent completes with X% reliability.
So a 50% time-horizon of 1 hour means "this system finishes the kinds of tasks a professional takes ~1 hour to do—about half the time." It's simple, comparable across models, and maps to how businesses actually work: in hours of human labor.
The chart on page 2 shows the big picture: the length of tasks frontier agents can do at ~50% reliability has doubled roughly every seven months since 2019.
What METR just found (the punch-list)
• Frontier reliability in 2025 ≈ "about an hour." Claude 3.7 Sonnet's 50% horizon ≈ 59 minutes; OpenAI's o1 ≈ 39 minutes (method: logistic fit over human-timed tasks from seconds to 8-hour projects).
• But "high-confidence" work is shorter. The 80% horizon is ~5× shorter (Claude 3.7 Sonnet ≈ 15 minutes). Use 80% when you need dependable output without lots of review.
• Progress rate: 50% horizon doubling ~every 212 days (95% CI 171–249). 2024–25 may be faster than the long-run trend.
• Forecast: If this trend holds, 1-month horizon (~167 work-hours) lands ~2028–2031. Treat as directional—useful for planning, not a promise.
• Why it's improving: better tool use, more robust reasoning, and notably less looping/repeating failed actions versus earlier models; failure modes are shifting.
• Messy reality check: On "messier" tasks (underspecified, shifting context, limited feedback), success drops—but the improvement trend stays similar (i.e., progress doesn't disappear in the real world, it's just offset by a constant factor).
Why this helps execs
Most AI scorecards muddle capability and reliability. Time-horizon separates them.
- Capability: "What length of work can it sometimes do?" (50% horizon)
- Reliability: "How short must the task be to be usually right?" (80% horizon)
- Risk/Cost: "What review or guardrails do we add to make 50%-grade work usable?"
This is the mental model you can use in board packs, budget asks, and vendor comparisons—without drowning everyone in benchmark acronyms.
How to use time-horizon in your roadmap
- Clock your tasks. Inventory key workflows by typical human hours: 2–5 min, 15–30 min, 1–2 hr, 4–8 hr. Use representative samples, not hero tasks.
- Pick your bar:
- Assist tier: compare to 50% horizon (cheap throughput + human QA).
- Autonomy tier: compare to 80% horizon (low-touch).
- Engineer for reliability. METR shows systems do better with clear feedback loops and tool access (tests, linters, retriable steps). Add them; your effective horizon jumps.
- Account for messiness. Mark tasks that are under-specified, punitive on errors, or coordination-heavy. Expect a constant performance discount—but not a different trend. (See messiness split charts on pp. 14–16.)
- Set guardrails by horizon gap. If your target task is 60 minutes and your model's 80% horizon is 15, break the work into 4 verified chunks (tests/specs per chunk) instead of betting on a single shot.
What crosses the threshold next quarter?
Given today's ~15-minute 80% horizon and ~40–60-minute 50% horizon, you can plan low-touch autonomy for:
- Data munging & reformatting (CSV/JSON transforms, light schema mapping) with built-in checks.
- Ticket triage & routing where success criteria are explicit and verifiable.
- Simple dbt / SQL refactors behind unit tests.
- Ops playbook lookups & templated comms with structured prompts + QA.
And you can plan human-in-the-loop assist for 45–60 minute chunks such as bug isolation with tests, analytics one-pagers, and small internal scripts—paired with checkers and a review step. (See the logistic fit plots on p. 11 for how success decays as human task length grows.)
Caveats leaders should actually care about
• Horizon ≠ everything. It's measured on software-like tasks. External validity to messy, high-context work is imperfect—but METR's internal PR experiment suggests horizons align better with low-context contractors than maintainers, which is still useful for cost models. (Baseliners were 5–18× slower than repo maintainers.)
• Reliability matters more than peaks. The 80% horizon is what you staff around; the 50% horizon is what you prototype around. Expect ~5× gap today.
• Trend could bend. Agency training and inference-time compute may speed things up; compute constraints may slow them. Plan with ranges. (See sensitivity on p. 18.)
A simple scorecard you can lift
For each candidate workflow:
- Human time bucket: ☐ <15m ☐ 15–60m ☐ 1–4h ☐ >4h
- Messiness flags (tick all that apply): ☐ underspecified ☐ punishing errors ☐ real-time coordination ☐ limited verification ☐ novelty required. (More flags → discount expectations.)
- Verification in place? ☐ tests ☐ checkers ☐ sampling QA
- Target mode: ☐ low-touch autonomy (80%) ☐ assist + review (50%)
- Guardrails: ☐ chunked stages ☐ retries ☐ tool access
Run this once; you'll see a clear queue of near-term automation wins and a handful of "de-risk with better specs/tests" items.
Why now
Time-horizon turns a fuzzy conversation ("Is AI ready?") into an operational one ("Which X-minute tasks flip this quarter, which flip next?"). It's also the forecasting hook into our upcoming report: we'll map common SME workflows to current and projected horizons so you can budget, staff, and de-risk accordingly.
Source: METR, "Measuring AI Ability to Complete Long Tasks." See the trend chart on p. 2, reliability gap on p. 12, messiness analysis on pp. 14–16, and 1-month extrapolation on pp. 18–23 for the details behind the numbers.
Part of The Intelligence Shift
Subscribe to The Intelligence Shift: Join the newsletter