AI Reality Checks
Is it actually working? Demo-vs-reality posts, hype audits, 'what they're not telling you' takes on model releases and tool launches.
AI Burnout Isn't From Typing More — It's Judgment Drain: Why Agent Users Hit a Wall at 4 Hours
Managing agent fleets depletes a different cognitive resource than normal work. Judgment drain caps productive hours at 4-5 — not 8-10. Here's the mechanism.
AI Is Already Doing 25% of Tasks in Half of All Jobs: 6 Data Points That Reframe the Displacement Debate
Anthropic's Economic Index found 49% of jobs have had a quarter of their tasks done by Claude. Here's what the full data picture actually shows.
What Is the Anticipation Gap? Why Consumer AI Agents Are Still Reactive
Most AI agents wait to be asked. The anticipation gap explains why truly proactive agents don't exist yet and what it will take to build them.
ARC Evals' Time Horizons Benchmark: 5 Caveats the Researchers Themselves Want You to Know
A third of tasks use estimated human baselines. Error bars are 2x on either side. The researchers behind Time Horizons explain what the numbers actually mean.
How to Audit Your Job for AI Risk in 10 Days: The TCLD Framework Explained
Tag every calendar item and work output over 10 business days into Theater, Commodity, On-the-Line, or Durable. Here's the full method.
Why Consumer AI Agents Still Feel Disappointing: 5 Rungs They Haven't Climbed Yet
The ladder of trust — from read-only to fully autonomous — explains exactly where every consumer agent product is stuck and what it would take to move up.
Ezra Klein's Counterintuitive Argument: Mass AI Unemployment Would Actually Be Easier to Handle Than What's Coming
Klein argues 80M displaced workers would force policy action — but 8M targeted ones get ignored like the China trade shock. Here's why that matters.
GPQA vs. Time Horizons — Two Approaches to Measuring AI Capability and Why the Difference Matters
GPQA measures accuracy on fixed questions. Time Horizons measures task duration. The GPQA creator explains why both approaches have blind spots.
Software Engineering Job Postings Are Up 18% Since May 2025 — The Most AI-Exposed Job Is Accelerating
Citadel Securities data shows software engineering postings up 18% since May 2025. The most AI-exposed occupation is seeing demand accelerate, not collapse.
Agent Burnout Hits at Hour 4 — Not Hour 8: Why AI-Assisted Work Drains Differently Than Normal Work
Agent work burns through judgment and context-switching, not typing. Why you hit a wall at 4 hours and what to do about it.
AI Benchmarks Are Broken: 5 Methodological Flaws in Time Horizon Metrics You Need to Understand
A fixed-slope fix alone would push Meter's numbers up 35%. Five structural problems with how AI capability benchmarks are built and reported.
Run the 4-Bucket AI Job Audit in 20 Minutes: Which Parts of Your Work Are Already on Thin Ice?
Theater, Commodity, On-the-Line, Durable. Audit the last two weeks of your work and find out what AI can already replace before your boss does.
Anthropic's Economic Index Shows 49% of Jobs Already Have 25%+ of Tasks Done by Claude — Is Yours One of Them?
Nearly half of all jobs have already handed a quarter of their tasks to Claude. Here's how to find out where your role stands.
Beth Barnes on Meter's Time Horizons: The Error Bars Are 2x — Here's What the Benchmark Actually Tells You
Meter's co-founder admits error bars are 2x in either direction. Here's the honest breakdown of what time horizon benchmarks can and can't tell you.
GPQA: The Graduate-Level Benchmark Every Major AI Lab Uses — and Why Its Creator Says It Has Limits
David Rein built GPQA and now co-authors Hcast. He's the first to explain where graduate-level benchmarks mislead capability estimates.
How to Read an AI Time Horizons Report Without Getting Misled: A 10-Minute Interpretation Guide
Most readers misinterpret the 50th percentile framing. This guide explains what Meter's numbers actually mean for planning and policy.
The Legibility Paradox: 6 Actions to Take After You Audit Your Job for AI Displacement
Durable work must be visible but not fully specified. Six post-audit moves — from stopping theater to refusing commodity work — to protect your role.
SWE-Bench Score vs. Real Merge Rate: Why Your Agent's Benchmark Number Doesn't Match Production Reality
Agent solutions pass SWE-bench but merge at half the rate of human solutions. The gap between benchmark and production is wider than you think.
How to Use the GSD Framework to Prevent Context Rot in Long Claude Code Sessions
The GSD framework spawns fresh sub-agents per task so your main session stays clean. Learn how to install it and use it on complex multi-day projects.
Harvard and Stanford Physicians Built the Toughest Medical AI Benchmark Yet — Here's How AI Co-Clinician Scored
DeepMind's evaluation used 140 consultation dimensions, 20 synthetic clinical scenarios, and 10 real physicians as role-playing patients. Here are the results.
OpenAI's Goblin Problem: How RL Training in Codex Infected GPT-5.4 with Creature References Across Model Generations
GPT started mentioning goblins and gremlins in responses. The cause: RL 'nerdy personality' training in Codex scored creature references highly and bled…
Anthropic's Harness Detection Bug: 3 Things That Triggered Unexpected Claude Code Charges
A git commit mentioning 'hermes.md' triggered a $200.98 overage on a plan showing 86% unused. Here's exactly what caused it and how Anthropic responded.
What Is the Anthropic Billing Controversy? What It Means for AI Tool Vendors
Anthropic scanned user code for competitor harness keywords and charged extra. Here's what happened, why it matters, and what it means for AI tool builders.
How to Build an Agentic Coding Workflow: The PIV Loop Explained
The PIV loop—Plan, Implement, Validate—is a structured approach to AI-assisted coding that keeps you in the driver's seat without micromanaging every line.