· Mark James
Your AI Strategy is Missing the Process Layer
METR shows frontier AI can handle ~1-hour tasks at 50% reliability—even with 8 attempts. Everyone's waiting for better models. They're solving the wrong problem. A mediocre model in a great process beats a great model in no process.
The gap between model capability and business value
TL;DR: METR shows frontier AI can handle ~1-hour tasks at 50% reliability—even with 8 attempts. Everyone's waiting for better models to close that gap. They're solving the wrong problem. A mediocre model in a great process beats a great model in no process—and you can build that process today.
The reality check (from METR's data)
Current frontier models show a stark reliability cliff:
- 50% success (T50): ~59 minutes for Claude 3.7 Sonnet
- 80% success (T80): ~15 minutes for the same model
These numbers already include multiple attempts—METR ran ~8 independent tries per task. Even with 8 shots, you only get 50% success on hour-long tasks. That's roughly a 5× drop in task length for reliable outputs.
Most executives see this and think "I'll wait for better models."
They're missing the point.
Why process beats model
METR's approach uses multiple blind attempts—like asking 8 different people to each try independently, with no communication or learning between attempts. Real processes don't work this way.
A single execution with proper feedback loops outperforms multiple blind attempts:
- METR's approach: 8 one-shot attempts → 50% success on 59-minute tasks
- Single attempt with verification gates: Check and correct during execution
- Checkpoint recovery: Don't restart from zero on failure
- Inter-attempt learning: Second attempt knows what first attempt tried
The gap isn't in the AI—it's in the process wrapped around it. METR's "multiple blind attempts" represent the floor, not the ceiling.
The Process Upgrade Kit (making 50% usable today)
☐ Chunking: Break into ≤15-min steps (Claude 3.7 Sonnet's T80)
☐ Verification gates: Schema/rules/tests between chunks; catch failures early
☐ Checkpoint preservation: Save progress; recover from last good state
☐ Variation strategies: If approach A fails, try approach B (not A again)
☐ Smart escalation: Know when to stop trying; attach context for human review
☐ Full instrumentation: Log everything; feed lessons to next attempt
Example: Invoice matching (45-60 min human task) becomes 4 chunks: parse (10 min), match (15 min), exceptions (10 min), post (10 min). Gate each with schema validation. On failure, retry from last successful checkpoint with variation. Escalate only exception bucket to humans.
Traditional automation was deterministic—it worked perfectly on predefined paths and failed completely on exceptions. AI is probabilistic—it handles variation but needs verification. This changes everything about process design.
What changes in intelligence-native processes
Traditional process automation assumed:
- Deterministic steps (if X then Y, always)
- Rigid paths (all branches predefined)
- Failure stops the process
- Humans handle exceptions
- Quality through testing and rules
Intelligence-native processes assume:
- Probabilistic steps (usually works, might need retry)
- Adaptive paths (AI chooses approach)
- Failure triggers recovery (not restart)
- AI handles variation, humans handle escalations
- Quality through verification and feedback loops
The shift: from brittle automation (perfect on rails, fails off them) to resilient automation (handles the messy middle, needs guardrails).
Notably, failure modes are evolving too. METR's analysis shows models are getting better at avoiding basic traps—repeat-failed-action errors dropped from 12/31 (GPT-4) to just 2/32 (o1). The models aren't just getting more capable; they're failing more gracefully.
The executive playbook
Stop asking: "When will AI be good enough?"
Start asking: "Which processes can I redesign for today's AI?"
1. Inventory your processes by time-horizon
Map your work to METR's buckets:
- <15 minutes: Ready for autonomous execution (80% reliability)
- 15-60 minutes: Needs chunking and verification
- 1-4 hours: Requires human-in-the-loop
- >4 hours: Wait or decompose drastically
Note: Performance drops on underspecified or low-feedback tasks, but the improvement trend is similar across messy vs clean task subsets.
2. Design for recovery, not prevention
Traditional process design tried to prevent failure (expensive with humans). Intelligence-native process design expects failure (cheap with AI) and builds recovery paths:
- Checkpoints that preserve progress
- Verification that catches errors early
- Variation strategies (don't repeat failed approaches)
- Smart escalation (know when to stop)
This is fundamentally different from METR's "try 8 times independently" approach—you're building intelligent recovery, not just rolling the dice more often.
3. Instrument everything
You can't improve what you don't measure:
- Log every prompt, response, and verification result
- Track: attempts per success, recovery rate, escalation rate
- Version control prompts like code (they ARE your process now)
- Feed learnings from attempt N into attempt N+1
Your optimized prompts and recovery patterns become institutional knowledge—harder to copy than just switching to the same model.
4. Build for the capability curve
METR shows time-horizons doubling approximately every 212 days (95% CI: 171-249 days). Design processes that automatically capture these gains:
- Architecture that scales with model capability
- "Complexity gates" that open as models improve
- Regular model swap evaluations (quarterly)
- Processes that expand scope, not just speed
Re-baseline every two quarters against T50/T80. As horizons rise, open your complexity gates—longer chunks, fewer checkpoints, more ambitious workflows.
Your competitive advantage
Process improvement now has three vectors:
- Your improvements: Better prompts, smarter routing, learned recovery patterns (weekly/monthly)
- Model improvements: Longer time-horizons, better reasoning (per METR's curve)
- The interaction: New capabilities enable new process designs
While competitors wait for models to reach 80% reliability on hour-long tasks, you can achieve production reliability today by replacing "multiple blind attempts" with "intelligent processes with feedback."
A starter scorecard
Pick one process this week:
☐ Map human time: _____ minutes
☐ Current success rate (if automated): _____%
☐ Chunk into 15-minute verifiable steps
☐ Define verification method per chunk:
☐ Schema validation
☐ Business rule check
☐ Output assertion
☐ Human review
☐ Design recovery strategy:
☐ Checkpoint preservation
☐ Variation on retry (not repetition)
☐ Context-aware escalation
☐ Run 10 times, measure:
Successes: ___/10
Checkpoints that prevented full restart: ___
Average recovery actions per success: ___
Escalations required: ___
Compare this to 10 blind attempts (METR-style). The difference is your process advantage.
Next week: Refine based on data. Month 2: Expand scope. Quarter 2: Upgrade model and capture gains.
Why this works now
Three things converged:
- Models crossed useful thresholds (METR's data: ~1 hour at T50, even with 8 attempts)
- Failure became cheap (API costs dropped orders of magnitude)
- Verification became programmable (LLMs can check LLM output)
The companies that win with AI won't have better models—everyone will have access to the same frontier. They'll have better processes. And processes are something you already know how to build.
The question isn't whether AI is ready. It's whether your processes are.
This process layer insight became the foundation for everything that followed. The solution is a proper business management foundation built for abundant intelligence—which I'll introduce next.
Source: METR's time-horizon study shows the capability foundation—using multiple blind attempts. Process design with feedback loops is what turns that foundation into production value.
Part of The Intelligence Shift
Subscribe to The Intelligence Shift: Join the newsletter