How Google's New AGI Benchmark Measures Intelligence Across 10 Cognitive Dimensions

Why Single-Number Benchmarks Have Always Been the Wrong Tool for Measuring Intelligence

A model scores 90% on MMLU. Is it intelligent? Does that number tell you whether it can navigate an unfamiliar social situation, hold context across a long task, or reason about causes rather than correlations?

It doesn’t. And that’s precisely the problem Google DeepMind set out to address with a cognitive evaluation framework that measures AI capability across ten distinct dimensions — each one grounded in how human cognition actually works, each one benchmarked against measurable human performance baselines.

This is a significant shift in how the AI field thinks about AGI benchmarks. Rather than asking “what percentage of a fixed test set did this model get right,” the framework asks a more fundamental question: how does this system perform across the full range of cognitive capacities that define general intelligence?

Here’s what the framework measures, how it works, and what current models’ scores reveal about the distance between today’s AI and genuine AGI.

The Problem With How We’ve Been Measuring AI Intelligence

Most AI benchmarks measure one thing. MMLU tests factual recall and reasoning across academic subjects. The Pencil Puzzle Benchmark tests pure multi-step logical reasoning in isolation. Even ARC-AGI 3, which humans solve at 100%, focuses primarily on abstract visual pattern recognition.

These are useful. But they each capture a slice of intelligence, not the whole thing.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

The deeper issue is what researchers call the “jagged frontier” problem. AI capabilities are uneven — a model might perform at PhD-level on certain reasoning tasks while failing at basic physical intuition or social inference. Single-metric benchmarks hide this unevenness entirely. A model can look impressive overall while having catastrophic gaps in specific dimensions.

And gaps matter. If you’re building toward AGI — a system that can perform any cognitive task a human can — knowing that a model scores 85% on a single test tells you almost nothing about where it falls short.

There’s also the contamination problem. Models trained on large internet corpora have often seen benchmark test sets during training, which inflates self-reported scores significantly. A framework anchored in human cognitive science — rather than fixed test sets — is much harder to game.

What Is Google DeepMind’s Cognitive Evaluation Framework?

Google DeepMind’s framework draws directly from cognitive psychology’s established taxonomy of human mental abilities. Rather than designing AI-specific tests and comparing models against each other, the approach uses human performance as the baseline for each dimension.

The core idea: if AGI means “a system that can perform any cognitive task a human can perform,” then the only honest way to measure progress is against human cognitive baselines — not against other AI models.

The framework organizes evaluation around ten cognitive dimensions. Each dimension maps to a well-studied domain of human cognition, has its own suite of tasks and sub-tests, and reports a normalized score relative to an average human baseline (set at 1.0).

A model that scores 1.0 on a dimension performs at average human level. Scores above 1.0 indicate superhuman performance. Scores below indicate where capability gaps exist.

This setup produces something a single benchmark never can: a cognitive profile. You can see where a model is superhuman, where it’s close to human-level, and where it fails entirely.

The 10 Cognitive Dimensions, Explained

1. Perception

This dimension tests how accurately a system can interpret sensory information — primarily visual and auditory inputs in the current implementation.

Tasks include object recognition under occlusion, scene parsing from degraded images, and spoken language comprehension with background noise. The human baseline reflects average adult perceptual performance.

Current large multimodal models score well here, often reaching or exceeding human-level on static image tasks. Auditory perception under noise remains a consistent weak point.

2. Selective Attention

Attention testing goes beyond just “can the model focus on relevant information.” It measures whether a system can suppress irrelevant distractors, sustain focus across long tasks, and selectively attend to specific features in complex, cluttered inputs.

This maps to well-studied human attention tests like the Stroop task and the Attention Network Test. For AI, it translates to evaluating whether models can maintain task-relevant focus when prompted with irrelevant or misleading context.

Long context models have improved substantially here, but performance degrades predictably as task length increases — a pattern that mirrors working memory limits in humans.

3. Working Memory

Working memory is the capacity to hold and manipulate information over short time spans. In humans, it predicts performance on complex reasoning tasks, language comprehension, and planning.

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

The benchmark tasks require models to maintain and update representations over a sequence of steps, without referencing previous outputs. This is deliberately harder than most context-window tests, which allow models to “re-read” prior content.

Most frontier models show working memory capacity roughly equivalent to humans on 5–7 item tasks, with steep drop-offs beyond that. This is consistent with multi-step reasoning research showing accuracy degradation over longer chains.

4. Long-Term Memory and Knowledge Integration

This dimension tests whether a system can retrieve relevant knowledge accurately, integrate it with novel context, and resolve conflicts between stored knowledge and new information.

Human baselines here are deliberately set at expert level in specific domains — because the comparison is to people who have invested time in building domain knowledge. A model claiming to be a useful expert system should match expert human memory on that domain.

Current models often excel at breadth but struggle with depth — particularly when questions require integrating across multiple pieces of stored knowledge simultaneously, or resolving contradictions between training data.

5. Language Comprehension and Pragmatics

This goes well beyond grammar and vocabulary. Pragmatics — the study of how context shapes meaning — is one of the hardest aspects of language for AI systems to handle reliably.

Tasks include resolving ambiguous pronouns in complex sentence structures, interpreting indirect speech acts (requests phrased as questions), understanding sarcasm and irony from context, and identifying implied meaning in underspecified instructions.

This is an area where understanding what LLMs actually do under the hood becomes relevant — surface-level pattern matching often fails on pragmatic tasks even when literal comprehension is strong.

6. Abstract and Inductive Reasoning

Abstract reasoning tests the ability to identify patterns, generalize rules from examples, and apply those rules to novel instances. This is the core of what frameworks like ARC-AGI measure.

The DeepMind framework includes both visual and conceptual abstraction tasks. Scores here tend to be the most revealing — and the most sobering. Frontier models have scored 0% on ARC-AGI 3, a benchmark humans solve at 100%, which illustrates how deep the gap remains on genuine abstract reasoning.

The 10-dimension framework confirms this: abstract reasoning is consistently one of the lowest-scoring dimensions across all evaluated models.

7. Causal Reasoning

Causal reasoning is distinct from correlation-based pattern matching. This dimension asks whether a model can correctly identify cause-and-effect relationships, simulate counterfactual scenarios (“what would have happened if…”), and reason about interventions in causal systems.

Human causal reasoning is far from perfect — we have well-documented biases. But the human baseline here is set at the level of an average adult answering structured causal questions, not an expert scientist.

Current models perform surprisingly poorly here. They can recite causal relationships from training data but struggle to reason about causal structure in novel scenarios. This reflects a fundamental limitation in how large language models represent the world — their chain-of-thought outputs often don’t faithfully reflect the actual reasoning process.

8. Planning and Goal-Directed Behavior

Planning requires representing a goal state, assessing the current state, identifying gaps, and sequencing actions to close them. This dimension tests multi-step planning across a range of task types — from simple scheduling problems to complex resource allocation under constraints.

AI agents have improved substantially on planning tasks. But the framework specifically tests for plan robustness — what happens when conditions change mid-task and the system must replan. This is where most current models show fragility.

Theory of mind — the ability to model other people’s beliefs, intentions, and knowledge states — is one of the most distinctly human cognitive capacities. And one of the most important for any system intended to work alongside humans.

Tasks include false-belief tests (understanding that others can hold beliefs you know to be wrong), perspective-taking in ambiguous scenarios, and predicting how different people would interpret the same information differently.

Current models perform inconsistently here. They can pass simple first-order theory of mind tests but fail reliably on second-order and third-order tasks (“Alice thinks that Bob believes that…”). This has direct implications for agentic AI systems that need to navigate human social contexts.

10. Metacognition and Calibration

The final dimension tests whether a system accurately knows what it knows — and what it doesn’t. Calibration is the alignment between confidence and accuracy: a well-calibrated system is highly confident when it’s correct and appropriately uncertain when it might be wrong.

This is arguably the most practically important dimension for deployed AI. An overconfident system that presents incorrect information with the same tone as correct information is dangerous in real use cases.

Most frontier models remain poorly calibrated, particularly on questions at the edge of their training distribution. The metacognition dimension exposes this in ways that average accuracy scores hide entirely.

How Current Models Score Across All 10 Dimensions

No current model reaches human-level performance across all ten dimensions. The profile of strengths and weaknesses varies by model family, but some patterns are consistent:

Where AI tends to exceed humans:

Perception (on controlled visual tasks)
Long-term memory breadth
Language fluency (surface-level)

Where AI approaches human-level:

Working memory (on shorter tasks)
Planning (on stable, well-defined problems)

Where AI consistently falls short:

Abstract and inductive reasoning
Causal reasoning
Social cognition (especially second-order and above)
Metacognition / calibration
Selective attention under heavy distraction

This profile is consistent with what other benchmarks show. The gap between GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro narrows on straightforward language tasks and widens sharply on abstract reasoning and causal inference.

Why Human Baselines Matter More Than Model-vs-Model Comparisons

One of the framework’s most important design choices is anchoring every score to human performance rather than to other AI models.

Most benchmark leaderboards compare models to each other. That’s useful for choosing which model to deploy, but it doesn’t answer the harder question: how close is any of this to human-level general intelligence?

When the baseline is human performance, you get a clearer picture. A model that tops every AI leaderboard might still score 0.3 on causal reasoning relative to the human baseline of 1.0. That’s a fact that model-vs-model comparisons would completely obscure.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

This matters for the ongoing debate about whether we’ve achieved AGI or are close to it. Questions about whether OpenAI has built AGI are hard to answer without a framework that defines what AGI would actually look like in measurable terms. The 10-dimension framework at least provides a rigorous structure for the conversation.

What This Means for Gemini Specifically

Google DeepMind built this framework partly as a rigorous evaluation tool for its own models. Gemini’s architecture is explicitly designed for multimodal reasoning — which maps well to several dimensions of the cognitive framework, particularly perception and language comprehension.

But the framework also reveals where Gemini, like every other current frontier model, has significant gaps. Abstract reasoning and causal inference remain weak across the board. Social cognition at the second-order level is unreliable. Metacognitive calibration is inconsistent.

The cognitive evaluation framework serves a dual purpose for DeepMind: it gives them a principled roadmap for what capabilities need improvement, and it gives the broader research community a shared language for measuring progress that doesn’t rely on easily gamed fixed test sets.

Given the broader strategic differences between Anthropic, OpenAI, and Google on how to approach AGI development, having a clearly defined cognitive framework is also a competitive positioning move — it sets the terms for what “general intelligence” means in a way that favors Google DeepMind’s research agenda.

The Limits of Any Benchmark, Including This One

The 10-dimension framework is more comprehensive than anything that came before it. But it’s still a benchmark, and all benchmarks have limits.

A few honest caveats worth noting:

Domain coverage is still incomplete. The framework focuses heavily on cognitive tasks that can be measured in controlled settings. Emotional intelligence, aesthetic judgment, and physical intuition remain difficult to test rigorously and are underrepresented.

Human baselines are themselves variable. “Average human performance” varies significantly by age, education, cultural context, and the specific population used to establish the baseline. The framework’s human baselines are grounded in standardized cognitive tests, but those tests have their own limitations.

Benchmark contamination is still possible. A model trained extensively on cognitive psychology literature and human test performance data has an advantage that doesn’t reflect real-world capability. The framework addresses this partly through novel task construction, but it’s not immune.

High scores don’t guarantee real-world capability. Benchmark performance and actual deployment performance often diverge. The Remote Labor Index found that AI agents complete only 2.5% of real freelance work despite strong benchmark scores — a persistent reminder that test performance and practical utility are different things.

None of this invalidates the framework. It’s a genuine step forward. But the lesson from years of AI benchmarking is that every metric eventually gets optimized for in ways that outpace the underlying capability improvement it was meant to track.

Where Remy Fits Into This Picture

Frameworks like this cognitive evaluation benchmark matter for anyone building AI-powered software — because they clarify which cognitive tasks AI can actually be trusted with today, and which require human oversight.

The dimension scores have direct implications for how you design AI applications. If causal reasoning scores are well below human level, you don’t want an AI agent autonomously making decisions in causally complex systems without a human check. If metacognitive calibration is poor, you need output verification built into the workflow.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

Remy, the spec-driven development platform from MindStudio, is designed with this reality in mind. Rather than assuming AI can do everything, it gives you a structured way to define exactly what an application should do — in annotated prose that both humans and AI agents can reason about. The spec is the source of truth. The AI compiles it into working code.

This approach is more resilient to the cognitive gaps that benchmarks like this one expose. If a model’s causal reasoning or metacognitive calibration improves, the compiled output gets better automatically — because the spec stays the same and the underlying model improves. You’re not locked into the limitations of today’s models.

If you’re building AI-powered applications and want to work at the right level of abstraction, try Remy at mindstudio.ai/remy.

Frequently Asked Questions

What are the 10 cognitive dimensions in Google DeepMind’s AGI framework?

The framework evaluates AI against human baselines across: perception, selective attention, working memory, long-term memory and knowledge integration, language comprehension and pragmatics, abstract and inductive reasoning, causal reasoning, planning and goal-directed behavior, social cognition (theory of mind), and metacognition / calibration.

How does this benchmark differ from MMLU or ARC-AGI?

MMLU measures factual recall and academic reasoning. ARC-AGI focuses on abstract visual pattern recognition. Both test a single dimension of intelligence. The 10-dimension framework tests cognitive breadth and compares performance to human baselines rather than to other AI models — giving you a profile of strengths and gaps rather than a single aggregate score.

Has any AI model reached human-level performance across all 10 dimensions?

No current model performs at human-level across all ten dimensions. Current frontier models tend to exceed human performance on perception and language fluency tasks, approach human-level on working memory and planning for shorter tasks, and fall significantly short on abstract reasoning, causal inference, social cognition, and metacognitive calibration.

Why does the framework use human baselines instead of comparing AI models to each other?

Model-vs-model comparisons tell you which product to buy, not how far away any system is from genuine general intelligence. If the goal is AGI — a system that can perform any cognitive task a human can — the only honest baseline is human performance. Anchoring to human scores also makes it harder to game the benchmark by training specifically on the test set.

What does this framework tell us about AGI timelines?

It shifts the question from “which model wins on benchmarks” to “which specific cognitive capacities still have large gaps.” The consistently low scores on abstract reasoning, causal inference, and social cognition suggest those are the hardest remaining problems — and they’re the ones that matter most for genuinely autonomous, general-purpose AI systems.

Is Google DeepMind testing Gemini specifically against this framework?

Yes. The framework was developed partly as an internal evaluation tool for Gemini models, and scores have been published in technical reports. Gemini performs well on perception and multimodal language tasks — consistent with its architectural design — but shows the same gaps in abstract reasoning and causal inference that affect all current frontier models.

Key Takeaways

Google DeepMind’s cognitive evaluation framework measures AI against human baselines across 10 dimensions: perception, attention, working memory, long-term memory, language pragmatics, abstract reasoning, causal reasoning, planning, social cognition, and metacognition.
No current AI model scores at human-level across all 10 dimensions. Abstract reasoning, causal inference, and social cognition are the largest gaps.
Human baselines are more meaningful than model-vs-model comparisons for measuring AGI progress.
The framework produces a cognitive profile — revealing uneven capabilities that single-metric benchmarks hide entirely.
Understanding where AI falls short on these dimensions directly informs how to design AI applications that are reliable in practice.

One coffee. One working app.

You bring the idea. Remy manages the project.

WHILE YOU WERE AWAY

✓Designed the data model

✓Picked an auth scheme — sessions + RBAC

✓Wired up Stripe checkout

✓Deployed to production

Live at yourapp.msagent.ai

Benchmarks keep getting better at exposing what AI can and can’t do. The cognitive evaluation framework is the most rigorous version of that effort yet — and it makes clear that impressive aggregate scores can coexist with significant capability gaps. If you’re building on top of AI systems, working at the right level of abstraction matters. Remy is built for exactly that.

How Google's New AGI Benchmark Measures Intelligence Across 10 Cognitive Dimensions

Why Single-Number Benchmarks Have Always Been the Wrong Tool for Measuring Intelligence

The Problem With How We’ve Been Measuring AI Intelligence

What Is Google DeepMind’s Cognitive Evaluation Framework?