We Asked Claude, ChatGPT, Grok, and Gemini to Rank AI Labs — Their Self-Serving Answers Reveal a Lot

Claude Ranked Anthropic #2. ChatGPT Ranked OpenAI #2. Here’s What That Tells You.

When you ask an AI model to rank AI labs, it will almost always put its own lab near the top. That’s the finding buried inside a recent AI lab power ranking exercise from the AI Daily Brief — and it’s more interesting than it sounds.

Here’s the specific result: Claude put Anthropic at #2 (just barely above OpenAI). ChatGPT put OpenAI at #2. Grok and Gemini both put Microsoft at #2. Every single model put Google at #1, which is the one result that feels defensible on the merits. But the #2 slot is where the self-serving logic shows up clearly.

This isn’t a gotcha. It’s actually a useful signal for anyone building with these models or making decisions about which ones to trust for analysis tasks. If you understand why each model answers the way it does, you get a clearer picture of both the models and the labs behind them.

What the exercise actually was

The ranking used nine scoring categories with explicit point weights: compute and infrastructure (20 points), enterprise positioning (15 points), platform and ecosystem control (15 points), consumer positioning (10 points), model leverage (10 points), momentum (10 points), branded narrative (10 points), wedge (5 points), and X-factor (5 points). Total possible score: 100 points.

Hire a contractor. Not another power tool.

Cursor, Bolt, Lovable, v0 are tools. You still run the project.
With Remy, the project runs itself.

Each of the four models — Claude, ChatGPT, Grok, Gemini — was asked to score the major AI labs using this rubric. Then a human (the show’s host) scored the same labs independently.

The AI consensus ranking came out: Google 91.4, OpenAI 85.4, Microsoft 84.9, Anthropic 83.1, Amazon 80.4. The top five all scored above 80.

The human ranking was noticeably harsher. Only three labs scored above 70: Google and OpenAI tied at 74, Anthropic at 70. Amazon came in at 64. The rest clustered between 58 and 64.

That gap — AI scores averaging in the high 80s, human scores averaging in the low 70s — is itself worth examining. But the more interesting story is in the #2 slot.

Why each model’s #2 answer makes sense (and what it reveals)

Claude → Anthropic #2

Claude’s reasoning, per the exercise, was that Anthropic has strong enterprise momentum and the hottest model narrative right now. That’s not wrong — Anthropic’s ARR growth through 2026 has been significant, and Claude Code has been one of the dominant stories in developer tooling. But “my maker is #2” is still a suspicious conclusion when the margin over OpenAI is razor-thin.

What this probably reflects: Claude’s training data skews toward Anthropic’s own communications, safety research, and the communities that discuss Anthropic favorably. The model isn’t lying. It’s pattern-matching on the information it was most exposed to.

ChatGPT → OpenAI #2

ChatGPT’s answer is the most straightforward case of home-field bias. OpenAI has real strengths — the consumer base, the brand recognition, the recent momentum around GPT-5.5 and Codex. But placing itself at #2 while acknowledging (in the same assessment) that Anthropic has the hottest enterprise momentum story is a contradiction the model doesn’t fully resolve.

The human scorer gave OpenAI a 10/10 on momentum — the highest of any lab — but only a 10/15 on enterprise positioning, reflecting that enterprise has historically been more Anthropic’s territory. ChatGPT’s self-assessment glosses over that gap.

Grok and Gemini → Microsoft #2

This one is the most interesting. Neither Grok nor Gemini is made by Microsoft. So why do both of them put Microsoft at #2?

The most likely explanation: both models weight infrastructure and enterprise incumbency very heavily, and Microsoft’s position on those dimensions is genuinely strong. Microsoft holds a 27% equity stake in OpenAI, a non-exclusive license to OpenAI’s IP through 2032, and Azure is one of the primary distribution channels for frontier models. On paper, that’s a formidable position.

But there’s also something else going on. Grok is made by xAI, which has no particular reason to flatter Anthropic or OpenAI. Gemini is made by Google, which already claimed the #1 slot. For both models, Microsoft is a “safe” #2 — it doesn’t require elevating a direct competitor. It’s the answer that looks most neutral while still being defensible.

The human scores tell a different story

The most striking divergence between the AI consensus and the human scores is on enterprise positioning.

The human scorer gave Anthropic 14/15 on enterprise — the highest of any lab, tied with Microsoft. OpenAI got 10/15. Google got 8/15.

RWORK ORDER · NO. 0001ACCEPTED 09:42

YOU ASKED FOR

Sales CRM with pipeline view and email integration.

✓ DONE

REMY DELIVERED

Same day.

yourapp.msagent.ai

AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

That Google score will bother Gemini partisans. The argument for it: Google’s enterprise relationship has always been structurally odd. Companies that aren’t locked into the Microsoft ecosystem often end up using Google Workspace by default — Drive, Sheets, Gmail — but Google has historically struggled to convert that presence into deep enterprise relationships at the highest levels. That pattern has followed Gemini into the enterprise. The tools are there. The trust and the sales motion haven’t fully materialized.

The human scorer’s reasoning on enterprise incumbency is also worth sitting with: enterprises right now are treating AI adoption as a larger transformation than picking a new software vendor. They’re going direct to the model labs — Anthropic and OpenAI specifically — in a way that’s different from how they’d evaluate even a successful startup. Microsoft scores well for distribution, but it’s distributing other companies’ models. Some buyers want that flexibility. Many are showing they want the source.

For a deeper look at how the three leading labs are actually approaching enterprise differently, the Anthropic vs OpenAI vs Google agent strategy comparison breaks down each lab’s structural bets.

The momentum scores are where things get weird for Google

Google had the highest overall score in the human ranking — 74, tied with OpenAI. But the human scorer gave Google only 3/10 on momentum.

That’s not a typo. The lab with the best full-stack position, the most compute, the deepest ecosystem, is scoring near the bottom on momentum in 2026.

The reason: this year has been dominated by agentic use cases built on top of coding capabilities. And basically no one is reaching for Gemini for that above GPT or Claude. The developer conversation in 2026 has been about Codex, Claude Code, and the shift to agentic workflows. Google hasn’t broken into that conversation in a meaningful way.

The human scorer gave OpenAI 10/10 on momentum — a reflection of the very recent shift around GPT-5.5 and Codex specifically. Anthropic got 8/10, reflecting strong ARR growth and Claude Code’s developer traction, even as demand has started to outpace supply. Amazon got 6/10, which the scorer attributed to Amazon using its compute and capital to throw its weight around in ways that are underappreciated.

Google IO is coming up in a few weeks, and there are reports of a Sergey Brin-led strike team working on coding models. If Google comes out of IO as a real contender on coding-based use cases, that 3/10 momentum score could move fast. If they don’t, the structural advantages won’t matter much in the near term.

What the AI scores get wrong (and why it matters for builders)

The AI consensus scores are uniformly high. The top five labs all scored above 80. The human scores are much more spread out, with the bottom labs in the 58-64 range.

One interpretation: the models are trained to be diplomatic. Giving a lab a 58/100 feels like a harsh judgment, and models are generally trained away from outputs that could be read as aggressive or dismissive. The result is scores that cluster high and don’t differentiate much.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Another interpretation: the models genuinely don’t have good visibility into the things that matter most right now. Compute ownership, enterprise sales motion quality, developer community sentiment — these are things that are hard to assess from training data. The models can recite that Google has TPUs and that Microsoft has Azure, but they can’t easily assess whether Google’s enterprise sales motion is actually converting or whether developers are actually reaching for Gemini in their daily work.

This has a practical implication. If you’re using an AI model to do competitive analysis — of AI labs or of anything else — the model’s own position in the competitive landscape is a meaningful source of bias. It’s not that the model is deliberately misleading you. It’s that its training data reflects the world as seen from a particular vantage point.

When you’re building multi-model workflows where you need genuinely neutral analysis, using a single model for competitive assessments in its own domain is a design flaw. Platforms like MindStudio make it practical to route the same query through multiple models simultaneously — 200+ models available — and compare outputs, which is exactly the kind of cross-check that catches this kind of systematic lean.

The xAI X-factor score is its own category

One score in the human ranking stands out for a different reason: xAI got an 8/5 on X-factor. Above the maximum.

The reasoning: whatever you think of Elon Musk, not betting against him has been one of the more reliable heuristics in tech over the last 20 years. xAI has strong compute scores (a leading indicator for everything else), models that are competitive but not yet state-of-the-art, and an owner who is very publicly trying to build the best models in the world.

The human scorer’s argument is that xAI’s 5/10 on model quality is a stronger 5 than Amazon’s or Microsoft’s 5. Amazon and Microsoft score 5 because they have access to all the models but own none of them. xAI scores 5 because its models are behind the frontier — but the trajectory is different. There’s a clear path to improvement. For Amazon and Microsoft, the model quality ceiling is set by whoever they’re licensing from.

This is the kind of nuance that the AI consensus scores flatten out. All three got similar scores. The human scorer is making a qualitative argument about the type of 5, not just the number.

The broader point: it’s not zero-sum

Miles Brundage made an observation that’s worth repeating here: there’s a lot of implicit zero-sum thinking in AI lab comparisons. The assumption that only one of OpenAI, Anthropic, or Google will succeed, and that one’s growth comes at the expense of the others.

Dylan Patel from SemiAnalysis made the same point more bluntly on a recent podcast: “It’s pretty clear even the tier two or tier three labs are going to be sold out of tokens.” The economic value that the best models can deliver is growing faster than the infrastructure can serve it. All the tokens that can do agentic things are going to be used.

That framing matters for how you read these rankings. The question isn’t really “who wins.” It’s “what are each lab’s structural strengths, and how do those map to what you’re actually trying to build?”

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

If you’re building something that depends heavily on coding capabilities right now, the momentum scores point clearly toward OpenAI and Anthropic. If you’re building enterprise workflows that need to live inside existing Microsoft infrastructure, the enterprise incumbency scores for Microsoft look different than they do in a greenfield context. If you’re thinking about where the compute story goes over the next two years, Google’s 17/20 compute score and xAI’s high compute score are the leading indicators to watch.

For builders evaluating which models to use in production, the GPT-5.5 vs Claude Opus 4.7 coding comparison gets into the specifics of real-world performance differences that these high-level rankings can’t capture. And if you’re thinking about sub-agent architecture specifically, the GPT-5.4 Mini vs Claude Haiku sub-agent comparison is worth reading alongside the lab-level analysis.

What to do with this

The self-serving bias in AI lab rankings isn’t a scandal. It’s a property of the systems, and once you know it’s there, you can work around it.

A few practical takeaways:

When you ask a model to evaluate its own lab’s products or positioning, treat the output as one data point with a known lean — not as neutral analysis. The model isn’t lying. It’s just not well-positioned to be objective about its own maker.

The categories that matter most shift over time. Right now, compute ownership and coding-based momentum are the two dimensions where the rankings are most in flux. Enterprise positioning is more stable but also more nuanced than incumbency scores suggest.

The gap between AI scores (high 80s average) and human scores (low 70s average) is itself a signal. The models are trained toward diplomatic outputs. If you need differentiated analysis, you probably need a human in the loop — or at minimum, you need to explicitly prompt for the harshest defensible assessment, not the balanced one.

If you want to build your own scorecard and contribute to the community rankings, the tool is at aipowerrank.ai. The nine categories and weights are all adjustable, which means you can see how sensitive the final rankings are to your assumptions about what matters most right now.

The rankings will change. Google IO is weeks away. Anthropic’s Mythos model is somewhere in the pipeline — the Claude Mythos benchmark results show 93.9% on SWE-bench, which would shift the model quality scores significantly if it ships. And if you’re building applications that need to compile from a spec rather than stitch together API calls, tools like Remy represent a different layer of the stack entirely — one where the question of which underlying model is “best” becomes less important than whether your spec is precise enough to compile correctly.

The labs are all going to be busy. The tokens are going to get used. The more interesting question is what you build with them.