How to Use GPT-5.5 in Codex for Real-World Agentic Tasks

Why GPT-5.5 in Codex Is Different From GPT-5.5 in Chat

Most people first encounter GPT-5.5 through ChatGPT. They ask it a question, get a good answer, move on. That’s fine — but it misses what the model was actually designed for.

GPT-5.5 is built around agentic AI tasks: multi-step workflows, tool use, long-horizon planning, and autonomous execution across a codebase. Chat is a narrow use case for it. Codex is where it actually operates the way OpenAI intended.

This guide is about getting practical value from GPT-5.5 inside Codex. That means selecting it correctly, using plan mode the right way, managing token costs, and structuring tasks so the model can actually finish them without going sideways.

If you want a broader overview of the model itself, see What Is GPT-5.5? OpenAI’s New Flagship Model Explained. This article is specifically about the Codex workflow.

What Makes GPT-5.5 Well-Suited for Agentic Work

Before getting into setup, it helps to understand what GPT-5.5 actually does differently compared to previous models.

Stronger instruction-following over long tasks

Earlier GPT models had a tendency to drift. You’d give them a complex, multi-step task and by step four or five, they’d start interpreting the original goal loosely. GPT-5.5 holds the original intent better. It’s more reliable at the kind of 20-30 step tasks that Codex agents are expected to run through.

Improved tool use

GPT-5.5 handles tool calls — file reads, shell commands, API calls, test runners — more consistently. It’s less likely to call a tool unnecessarily, pass malformed arguments, or loop on failed tool invocations. This matters a lot for Codex because Codex agents spend most of their runtime executing tool calls, not just generating text.

Better token efficiency

Token efficiency improvements in GPT-5.5 build on earlier work. The predecessor model, GPT-5.4, introduced tool search as a way to reduce unnecessary token consumption. GPT-5.5 extends that approach — how tool search cuts token usage is worth understanding if you’re running long sessions. GPT-5.5 applies similar logic to broader reasoning steps, not just tool selection.

Better plan-then-execute behavior

GPT-5.5 separates planning from execution more cleanly when prompted to do so. This is directly relevant to Codex’s plan mode, which we’ll cover in detail below.

How to Activate GPT-5.5 in Codex

Step 1: Open Codex and create or select a project

Codex runs as part of OpenAI’s developer platform. If you’re not familiar with how Codex has evolved into a broader developer environment, the Codex as a Super App overview gives useful context on where it sits in OpenAI’s ecosystem.

Within Codex, you can create a new project or open an existing one. The model selection applies per-session, not globally, so you’ll set it for each working session.

Step 2: Select GPT-5.5 as your model

In the Codex interface, look for the model selector — it’s typically in the top-right area of the session panel or within session settings. GPT-5.5 will appear in the model list as gpt-5.5 or gpt-5.5-turbo depending on your API tier.

If you’re on the standard ChatGPT Plus plan, you may only see a subset of models. GPT-5.5 in Codex typically requires a Pro plan or direct API access. Check your account tier if the model isn’t appearing.

Step 3: Set your environment context

Before issuing any task, set up your environment context. This includes:

The repository or working directory
Any relevant constraints (tech stack, test framework, deployment target)
What “done” looks like for the session

GPT-5.5 performs better when it has a clear success criterion upfront. Don’t just say “fix the auth bugs.” Say “fix the auth bugs, confirm all existing tests pass, and ensure no new failing tests are introduced.”

Using Plan Mode in Codex With GPT-5.5

Plan mode is one of Codex’s most underused features, and GPT-5.5 makes it substantially more useful.

What plan mode does

In standard mode, Codex takes your task and starts executing immediately. The agent reads files, makes changes, runs tests, iterates. This works for smaller tasks.

For larger tasks — refactoring a module, adding a new feature end-to-end, migrating a data model — jumping straight to execution often produces worse results. The agent makes assumptions early that constrain later decisions.

Plan mode inserts a planning step before execution. GPT-5.5 reads your task, surveys the relevant code, and produces a structured plan: what it intends to change, in what order, and what it expects to validate at each step. You review the plan before the agent starts making any edits.

How to enable it

In Codex, look for the “Plan before executing” toggle in session settings, or prefix your task with explicit planning instructions:

Before making any changes, produce a step-by-step implementation plan. 
Wait for my approval before proceeding to execution.

With GPT-5.5, this prompt reliably produces a clean, reviewable plan rather than a vague outline.

How to evaluate the plan

When GPT-5.5 returns a plan, check for three things:

Scope accuracy — Does the plan address what you actually asked? Watch for scope creep (the model deciding to “improve” things you didn’t ask about) or scope narrowing (the model missing a key requirement).
Sequencing logic — Does the order make sense? Changes to shared utilities should come before changes to the things that depend on them. Database migrations should precede code that uses the new schema.
Validation steps — A good GPT-5.5 plan includes explicit checkpoints: “run tests after step 3,” “confirm the endpoint returns 200 before proceeding to step 5.” If the plan skips these, add them before approving.

When to reject and re-plan

If the plan is wrong, don’t just approve and correct during execution. Reject it and give specific feedback:

The plan misses the need to update the API contract in openapi.yaml. 
Also, step 4 should happen before step 2, not after. Revise the plan.

GPT-5.5 will produce a revised plan that incorporates your feedback. Iterating on the plan is cheaper than iterating on broken code.

This is similar in concept to plan mode in other agentic coding tools — the core idea is the same: plan first, execute second, validate throughout.

Structuring Real-World Agentic Tasks for GPT-5.5

The model is capable. The prompts and task structure are where most people leave performance on the table.

Break large tasks into bounded subtasks

GPT-5.5 handles 20-30 step tasks, but “refactor the entire payments module” isn’t a well-bounded task. It’s a project. Break it down:

Task 1: Audit the existing payments module, produce a list of identified issues
Task 2: Refactor the charge initiation flow, keep all existing tests green
Task 3: Update the webhook handling logic, add test coverage for edge cases
Task 4: Update the API documentation to reflect the new interface

Each task has a clear end state. You review the output before moving to the next. This produces better results than a single massive prompt, and it gives you checkpoints to catch problems early.

Use file-level specificity

Vague tasks produce vague output. “Fix the login flow” is vague. “Fix the session token expiry logic in src/auth/session.ts — tokens should expire after 24 hours of inactivity, not 24 hours from creation” is specific.

GPT-5.5 can infer a lot from context, but being explicit about which files matter and what the correct behavior should be reduces both errors and token waste.

Define what “working” means

Tell the agent exactly how to verify success:

The task is complete when:
1. All existing tests in /tests/auth pass
2. The new test in /tests/auth/session_expiry.test.ts passes
3. The endpoint returns a 401 with the body {"error": "session_expired"} when a stale token is used

With these criteria defined upfront, GPT-5.5 can self-check before declaring the task done. Without them, it tends to stop when the code looks plausible — not when the behavior is verified.

Handle failures explicitly

Agentic workflows fail. Tests fail, tool calls fail, assumptions turn out to be wrong. Tell the agent what to do when that happens:

If a test fails after your changes, stop and report the failure to me 
with the test name, error message, and your hypothesis about the cause. 
Do not attempt further changes until I respond.

This prevents the agent from spiraling into increasingly speculative fixes when something breaks. Understanding how agentic workflows handle conditional logic and failure paths helps you write better failure-handling instructions.

Managing Token Costs When Running GPT-5.5 in Codex

GPT-5.5 is more token-efficient than its predecessors, but long agentic sessions still consume meaningful tokens. Here’s how to run them without unnecessary cost.

Limit context to what’s relevant

Codex gives you control over what files and context the agent can see. Don’t let it read the entire repository for a task that only touches two modules. Narrow the context explicitly:

For this task, only read files under /src/auth/ and /tests/auth/. 
Do not read or modify files outside these directories.

This isn’t just about cost — it also reduces hallucination risk. The more irrelevant context the model sees, the more likely it is to make changes or assumptions that touch things it shouldn’t.

Use plan mode to prevent expensive rework

A bad execution path wastes tokens. A plan that catches a wrong assumption before execution saves the tokens that would have gone into making the wrong changes and then undoing them.

This is the same principle behind token budget management in long AI agent sessions — spend tokens on planning upfront to avoid spending far more on rework later.

Set explicit stopping conditions

Without stopping conditions, a GPT-5.5 agent in Codex will keep iterating. Sometimes that’s useful. More often, it means the agent is adding polish you didn’t ask for, or retrying a failing approach in slightly different ways.

Set a maximum iteration count for test-fix loops:

If tests are still failing after 3 fix attempts, stop and report to me. 
Do not continue trying to fix independently.

Monitor token usage per session

Codex shows token usage in the session panel. Check it after the planning step and after significant execution milestones. If a session is consuming tokens much faster than you expected, it’s usually because the context window is bloated or the agent is spinning on something.

Practical Examples: What GPT-5.5 in Codex Handles Well

Codebase-wide refactors

GPT-5.5 handles refactors that span multiple files well, especially when the change is systematic. Replacing a deprecated API client across 40 files, updating import paths after a directory restructure, migrating from one state management pattern to another — these are well-suited tasks.

The key is giving the model a clear “before” and “after” definition and letting it work through files methodically with plan mode.

Writing test coverage for existing code

“Write tests for this module” is a task GPT-5.5 executes reliably. It reads the existing code, identifies the relevant behaviors and edge cases, and produces test files. This is one of the tasks where GPT-5.5’s code understanding shows clearly — it tends to catch edge cases that less capable models miss.

Debugging with log analysis

Give GPT-5.5 a set of error logs and the relevant source files, and ask it to identify the root cause and propose a fix. It’s good at pattern-matching across log output and code to locate where something went wrong. This is a strong agentic use case because it requires reading multiple sources, cross-referencing them, and forming a conclusion — not just generating code.

API integration and scaffolding

Asking GPT-5.5 to implement a third-party API integration — authentication, request handling, error handling, retry logic, basic test coverage — typically produces production-ready scaffolding. Give it the API documentation as context and specify your project’s existing patterns.

For how GPT-5.5 compares to Claude Opus 4.7 on tasks like these, see the head-to-head agentic coding comparison.

Where the Codex + GPT-5.5 Stack Falls Short

No tool is good at everything. Being clear about the limits saves time.

Long, ambiguous tasks without clear success criteria

If you can’t define what “done” looks like, GPT-5.5 can’t either. “Make the app better” is not a task. The model will fill in the blanks with its own interpretation, which may not match yours.

Tasks requiring deep domain knowledge you haven’t provided

GPT-5.5 knows a lot, but it doesn’t know your business logic, your team’s conventions, or the institutional context behind design decisions. If you need it to make decisions that depend on that context, you need to provide that context explicitly — or make the decision yourself and give the model a clear instruction.

Real-time debugging of running systems

Codex is a coding agent environment, not a live system monitoring tool. For debugging issues in production systems, you’re still going to need to pull logs, inspect state, and bring that context into Codex manually.

Where Remy Fits in This Picture

Codex with GPT-5.5 is a capable agentic coding environment. It works well when you’re working with existing codebases and know what changes you need to make.

Remy approaches the problem from a different angle. Instead of an AI agent modifying code you write, Remy compiles full-stack applications from a spec — a structured markdown document that describes what the app does, including data types, validation rules, and edge cases as annotations. The spec is the source of truth. The code is the compiled output.

That distinction matters for agentic tasks. When GPT-5.5 or any model modifies code directly, the codebase accumulates drift over time — the code stops cleanly reflecting any single coherent intent. With Remy, you maintain the spec. The code follows. When models improve, you recompile and get better output without rewriting anything.

If you’re building a new application rather than maintaining an existing one, Remy’s spec-driven approach is worth a look. You can try Remy at mindstudio.ai/remy.

Codex vs. Other Agentic Coding Environments

Codex isn’t the only agentic coding environment, and GPT-5.5 isn’t the only capable model. If you’re evaluating your options, a few comparisons worth reading:

Codex vs Claude Code in 2026: A practical breakdown of when each tool is the better fit
GPT-5.5 vs Claude Opus 4.7 for real-world coding: Head-to-head performance across common coding tasks
Best AI models for agentic workflows in 2026: A broader view of the model landscape for agentic use

The short version: GPT-5.5 in Codex is strong for developers already in the OpenAI ecosystem who want a tightly integrated agentic environment. Claude Opus 4.7 in Claude Code has different strengths, particularly for longer reasoning tasks and certain coding styles. The right choice depends on your workflow, not on which model has the higher benchmark score.

If you want to understand how agent harnesses affect performance regardless of model choice, how AI coding agent harnesses work at scale is a useful read.

Frequently Asked Questions

How do I know if GPT-5.5 is available in my Codex account?

Go to the model selector in Codex’s session settings. If GPT-5.5 doesn’t appear, you’re likely on a plan that doesn’t include it. GPT-5.5 in Codex typically requires a ChatGPT Pro plan or direct API access with a model that’s been enabled on your account. You can check OpenAI’s model availability documentation for the current list of which models are available on which plans.

Is plan mode available for all task types in Codex?

Plan mode is available for any task where you explicitly request it. It’s not automatically triggered. You either enable the “Plan before executing” toggle in session settings or include a planning instruction in your task prompt. For very short, well-defined tasks (fix a single typo, rename a variable), plan mode adds overhead without much benefit. For anything spanning multiple files or requiring multiple execution steps, it’s worth using.

How does GPT-5.5’s token efficiency compare to GPT-5.4?

GPT-5.5 is more efficient in agentic contexts — it makes fewer redundant tool calls, reads context more selectively, and produces more concise but accurate output. The improvement builds on the tool search approach introduced in GPT-5.4. You can read more about how tool search reduces token consumption for context on where the efficiency gains come from. In practice, GPT-5.5 sessions run noticeably longer before hitting token limits compared to previous models on the same tasks.

Can GPT-5.5 in Codex work with private repositories?

Yes. Codex connects to your repository via your configured GitHub or GitLab integration. The model operates on your actual codebase, not a copy, so it can read, modify, and commit to private repos. Standard access controls apply — the Codex agent operates with whatever permissions your connected account has.

What’s the difference between running GPT-5.5 via the Codex UI and via the API?

The Codex UI provides a managed agentic environment: file system access, shell execution, test running, and a structured session interface. The API gives you raw model access — you’re responsible for building the tool execution loop yourself. For most developers, the Codex UI is the right starting point. The API makes sense if you’re building custom agentic workflows or integrating GPT-5.5 into your own toolchain.

Should I use GPT-5.5 for every Codex task?

Not necessarily. GPT-5.5 is the strongest option for complex, multi-step tasks. For simple one-off queries, smaller models are faster and cheaper. OpenAI provides model options at different capability and cost tiers — using a lighter model for straightforward tasks and GPT-5.5 for the complex ones is the sensible approach. The same logic applies across any agentic stack: multi-model routing for agent token costs covers why this matters at scale.

Key Takeaways

GPT-5.5 is designed for agentic workflows, not chat. Its advantages show up in multi-step tasks, tool use, and long-horizon execution — not in answering single questions.
To activate it in Codex, select it from the model picker in session settings. It requires a Pro plan or direct API access.
Plan mode is your most useful lever for complex tasks. Use it, review the plan carefully, and iterate on the plan before approving execution.
Structure tasks with specific files, explicit success criteria, and defined failure-handling instructions. Vague tasks produce vague results.
Token efficiency is better than previous models, but still benefits from intentional context management and clear stopping conditions.
For teams building new applications rather than maintaining existing code, Remy’s spec-driven approach offers a different kind of reliability. Try Remy at mindstudio.ai/remy.