Codex vs. Claude Code: Context Window, Token Efficiency, and Which Lasts Longer Per Session

The Number That’s Misleading You About Codex Sessions

Codex has a 256,000-token context window. Claude Code with Opus has roughly 1 million. On paper, that’s a 4x disadvantage for Codex. In practice, the gap is considerably smaller — and for many workflows, it may not matter at all.

If you’re choosing between these two tools for a serious agentic project, the raw context window number is the wrong thing to optimize for. What you actually care about is how long a session lasts before you hit a wall, and that depends on token efficiency as much as window size.

This post is about that distinction.

The Number That Actually Matters: Effective Session Length

Context window size and session length are not the same thing.

A 1 million token window means nothing if the model burns through tokens at 4x the rate. And a 256K window means less than you’d think if the model is dramatically more efficient per task.

Nate Herk, who uses both tools daily, put it directly: “I have seen that my session is lasting way longer in Codex than it is in Claude Code. And a big part of that is because ChatGPT 5.5 is really, really efficient with tokens — with output tokens and input tokens.”

That’s an empirical observation from someone running real projects through both tools, not a benchmark. It’s the kind of signal worth taking seriously.

The underlying reason: GPT 5.5 tends to produce more concise outputs than Opus. It doesn’t over-explain. It doesn’t pad. When you’re running an agentic loop that might execute dozens of tool calls, that concision compounds. You can dig into the specifics in this GPT-5.5 vs Claude Opus 4.7 coding comparison, which found GPT 5.5 using 72% fewer output tokens than Opus 4.7 on equivalent tasks.

What the Specs Actually Say

Let’s be precise about what we’re comparing.

Codex: ~256,000 tokens per session. Resets every 5 hours (with a weekly reset as well). You can check your remaining session budget under Settings > Rate Limits Remaining, which shows percentage remaining and expiry time. The model running under the hood is GPT 5.5 (with options for 5.4 and others). Intelligence levels run Low, Medium, High, and Extra High — Extra High is recommended only for hard bugs, because it burns session faster and can over-engineer simple tasks.

Claude Code: ~1 million tokens with Opus (Sonnet and Haiku have smaller windows). No built-in session timer in the same sense — you’re working against token limits per conversation, not a rolling time window.

Both tools auto-compact context when sessions get long. Codex does this automatically, same as Claude Code does. So neither tool simply crashes when you approach the limit — they summarize and compress earlier context to keep the session alive.

The context window bar at the bottom of Codex chats shows your current session usage as a percentage. It’s a small UI detail, but it changes how you work. You can see the pressure building and make decisions accordingly — switch to a lower intelligence level, start a new chat, or compact manually.

Three Dimensions That Actually Determine Session Length

1. Model verbosity

This is the biggest variable. Opus is a verbose model. It reasons out loud, explains its decisions, and produces detailed output even when you didn’t ask for it. That’s often useful — but it costs tokens.

GPT 5.5 in Codex is more terse by default. The pragmatic personality mode (activated via /personality → pragmatic) makes it even more concise: “concise, task-focused, and direct” is how the setting describes itself. If you’re running long agentic sessions, that personality setting alone can meaningfully extend how far your 256K window takes you.

2. What you’re building

A single-file Python script and a full-stack dashboard with browser-use QA are not the same context load.

In Herk’s demo, Codex built a YouTube analytics dashboard — pulling 200 comments via the YouTube Data API, running analysis, generating an Excel workbook with multiple tabs, creating UI assets via GPT Image 2, running browser-use QA (which found 6 real bugs including broken YouTube external links), and then deploying to Vercel via GitHub. That’s a substantial session. It ran on Codex’s 256K window without hitting limits.

The /goal command — which activates a multi-hour agentic loop by editing a TOML config file — is the extreme case. Alex Finn used it to build a complete extraction shooter video game in one session, with auto-generated assets. That’s the kind of task where you’d expect context limits to bite. Apparently they didn’t, which says something about how efficiently GPT 5.5 manages that loop.

3. Context hygiene

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

Both tools reward the same habits. The agents.md file in Codex (the equivalent of CLAUDE.md in Claude Code) gets read at the start of every new chat. Keeping it lean — not dumping everything you’ve ever told the model into it — directly affects how many tokens each session starts with. Herk’s explicit advice: don’t put everything in agents.md, because a bloated onboarding doc eats tokens before you’ve done any real work.

The skills system compounds this. Reusable markdown instruction files stored in ~/.codex/skills/ (global) or project-local directories mean you don’t have to re-explain workflows in every chat. You call /youtube-comment-insights and the model reads the recipe. That’s not just a productivity feature — it’s a token management feature.

For Claude Code users who’ve built similar habits, the Claude Code token management techniques that work there translate directly to Codex. The principles are the same: front-load context efficiently, don’t repeat yourself, compact before you’re forced to.

Codex’s Session Model: What’s Different

The 5-hour rolling reset is a meaningful structural difference from Claude Code.

Claude Code’s limits are per-conversation token counts. Codex’s limits are time-windowed. That changes the calculus for long-running work. If you’re running an automation that refreshes weekly — like Herk’s Sunday 5pm YouTube comment refresh — you’re not worried about a single session limit. You’re worried about whether the automation has enough runway to complete before the window closes.

The automations tab in Codex handles this by injecting prompts into new Codex chats on a schedule (hourly, daily, weekly). Each automation run starts a fresh chat with a fresh context window. So a weekly automation that would theoretically exceed 256K tokens in a single session gets broken into a new session automatically. That’s a reasonable architectural workaround for the smaller window.

The side chat feature — opening a parallel conversation within the same project context — is another session management tool that doesn’t have a direct Claude Code equivalent. If your main session is getting heavy, you can offload exploratory questions to a side chat without burning the main session’s context.

Where Claude Code’s 1M Window Actually Wins

There are real scenarios where the larger window is the right answer.

Long-document analysis. If you’re feeding in multiple large codebases, extensive documentation, or months of logs, 256K can become a genuine constraint. Claude Code’s million-token window with Opus is the right tool for that work. The Claude Code effort levels post covers how to calibrate Opus’s reasoning depth for exactly these heavy-context tasks.

Exploratory reasoning. Herk’s own take: “I really like Claude for being sort of like exploratory and brainstorming and helping me get creative and think through things and plan.” Opus’s verbosity — the thing that costs tokens — is also what makes it good at working through ambiguous problems. When you want the model to think out loud and surface assumptions, that’s a feature, not a bug.

Switching between Opus, Sonnet, and Haiku in Claude Code gives you a finer-grained efficiency dial than Codex’s Low/Medium/High/Extra High intelligence levels. If you’re building a multi-agent system where some sub-agents do simple tasks and one does hard reasoning, Claude Code’s model flexibility is an advantage. The GPT-5.4 Mini vs Claude Haiku sub-agent comparison gets into this specifically for sub-agent use cases.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

When you’re building agents that need to connect to a wide range of external services, the orchestration layer matters as much as the model. Platforms like MindStudio handle this at a different level — 200+ models, 1,000+ integrations, and a visual builder for chaining agents — which is worth considering if you’re building workflows that need to survive model swaps or run across multiple providers without rewriting orchestration code.

Where Codex’s 256K Window Is Enough

For most coding and automation work, 256K is sufficient. The question is whether GPT 5.5’s efficiency keeps you inside that window for the tasks you actually run.

The evidence from real usage suggests it does, for typical agentic coding sessions. Building a dashboard, running browser-use QA, setting up GitHub/Vercel deployment, creating skills — these workflows fit comfortably. The 5-hour reset means you’re not accumulating debt across days of work in a single session.

The /goal command’s multi-hour loops are the stress test. If those work within the 256K window — and the demos suggest they do — then the efficiency argument holds up.

One thing worth noting about the full-stack app building workflow: when the output is a deployed application, the question of what to do with that code matters. Tools like Remy take a different approach to this problem — you write a spec as annotated markdown, and it compiles into a complete TypeScript backend, SQLite database, auth, and deployment. The spec is the source of truth; the generated code is derived output. That’s a different abstraction layer than what Codex or Claude Code offer, but it’s worth knowing the option exists when you’re thinking about how production apps get built from AI-assisted workflows.

Verdict: Which Tool for Which Work

Use Codex if: You’re running agentic coding sessions, building and deploying apps, automating workflows, or doing browser-use QA. GPT 5.5’s token efficiency means your 256K window lasts longer than the raw number implies. The 5-hour rolling reset and the automations tab handle long-running work cleanly.

Use Claude Code if: You’re doing deep document analysis, working with large codebases that need to live in context simultaneously, or doing the kind of exploratory reasoning where Opus’s verbosity is an asset. The 1M window with Opus is genuinely useful for these cases, not just a spec sheet number.

Use both if: You’re serious about this work. Herk’s workflow — Claude for brainstorming and planning, Codex for execution and troubleshooting — is a reasonable division of labor. They work out of the same local directory. An agents.md in Codex and a CLAUDE.md in Claude Code can describe the same project. There’s no reason to be tribal about it.

The 4x context window gap is real. The token efficiency gap is also real, and it runs in the opposite direction. Where they net out depends on what you’re building — but for most agentic coding work, Codex’s session length is competitive in practice, even if it looks worse on paper.

That’s the comparison that matters. Not the spec sheet. The session.