Skip to main content
MindStudio
Pricing
Blog About
My Workspace

OpenAI's Goblin Problem: How RL Training in Codex Infected GPT-5.4 with Creature References Across Model Generations

GPT started mentioning goblins and gremlins in responses. The cause: RL 'nerdy personality' training in Codex scored creature references highly and bled…

MindStudio Team RSS
OpenAI's Goblin Problem: How RL Training in Codex Infected GPT-5.4 with Creature References Across Model Generations

OpenAI Published a Post-Mortem on Why GPT Kept Mentioning Goblins. The Explanation Is Weirder Than You’d Expect.

Starting with GPT-5.1, OpenAI’s models developed a habit of slipping goblins, gremlins, and other creatures into their responses. Not occasionally. Persistently, across model generations, multiplying with each release until GPT-5.4 was riddled with them. OpenAI eventually published an explainer called “Where the Goblins Came From,” and the answer is a concrete illustration of how reinforcement learning quirks in one model can silently contaminate everything built on top of it.

If you build on top of these models — and you almost certainly do — this story is worth understanding in full.


The Leak That Started It

A tweet from user @arbs8021 went viral after they spotted something odd in a leaked GPT-5.5 system prompt for Codex. Buried in the instructions was this line:

“Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.”

That’s not a normal system prompt instruction. That’s a system prompt instruction written by someone who had clearly spent weeks watching a language model bring up goblins in contexts where goblins had no business appearing. Wired ran a piece headlined “OpenAI Really Wants Codex to Shut Up About Goblins.” OpenAI followed with their own explanation.

The goblin problem, it turns out, had been building for a while.


How the Goblins Got In

Everyone else built a construction worker.
We built the contractor.

🦺
CODING AGENT
Types the code you tell it to.
One file at a time.
🧠
CONTRACTOR · REMY
Runs the entire build.
UI, API, database, deploy.

OpenAI’s post-mortem traces the origin to a “nerdy personality” that was baked into Codex through reinforcement learning training. The idea was to give Codex a distinct character — technically enthusiastic, a little whimsical, the kind of AI that makes cute references to creatures and fantasy tropes. That personality was encouraged during RL training, and the scoring function rewarded outputs that leaned into it. Creature references scored well. So the model learned to make them.

That part is almost understandable. You wanted a nerdy coding assistant, you trained it to be nerdy, it started acting nerdy. Fine.

The problem is what happened next.

The goblin habit didn’t stay in Codex. Starting with GPT-5.1, the creature references began appearing in the broader GPT model family — models that were never supposed to have the nerdy personality at all. By GPT-5.4, the goblins had multiplied to the point where they were hard to miss. OpenAI’s conclusion: because Codex was involved in training the personalities of subsequent GPT variants, the RL scoring that rewarded creature references in the nerdy context bled into the non-nerdy training runs. The model didn’t know it was supposed to keep the goblins in one box.

This is the part that should make you pause. The goblins weren’t a bug in the traditional sense — no metric tanked, no eval flagged it. OpenAI writes that “unlike model bugs that show up through a tanking eval or a spiking training metric and point back to a specific change, this one crept in subtly.” A single goblin in an answer is charming. Across model generations, it becomes a pattern that requires a dedicated system prompt instruction to suppress.


Why This Is More Than a Funny Story

The goblin story is easy to laugh at. It is also a clean, documented example of a problem that has significant implications for how you think about model behavior at scale.

When you build an application on top of a foundation model, you’re not just inheriting the model’s capabilities. You’re inheriting its RL training history, including the parts that were never intended for your use case. The nerdy personality was designed for Codex. It wasn’t designed for GPT-5.4 being used in a customer service workflow or a financial analysis tool. But there it is anyway, occasionally suggesting that your accounts receivable situation resembles a goblin siege.

The fix OpenAI reached for — a system prompt instruction explicitly banning goblins and a list of related creatures — is the kind of solution that works but also reveals the underlying problem. You can’t audit what you can’t see. OpenAI’s own research team had to build new tools to detect behavioral patterns that weren’t showing up in standard evals. If a frontier lab with full model access needed new instrumentation to find this, consider what that means for everyone building downstream.

This connects directly to the GPT-5.4 vs Claude Opus 4.6 comparison question that a lot of builders are working through right now. Benchmark performance is one axis. Behavioral consistency — the absence of weird RL artifacts bleeding into your application — is another axis entirely, and it’s much harder to measure.


The Deeper Mechanism: RL Cross-Contamination

To understand why this happened, you need to understand how personality RL training works in practice.

Remy is new. The platform isn't.

Remy
Product Manager Agent
THE PLATFORM
200+ models 1,000+ integrations Managed DB Auth Payments Deploy
BUILT BY MINDSTUDIO
Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

When OpenAI trained Codex’s nerdy personality, they used a reward model to score outputs. Outputs that matched the target personality — technically enthusiastic, creature-referencing, whimsical — got higher scores. The model learned to produce more of those outputs. Standard RLHF.

The problem is that this scoring happened at a level that influenced the model’s general output distribution, not just its behavior in explicitly “nerdy” contexts. When subsequent GPT variants were trained using Codex-derived data or shared RL infrastructure, the creature-reference preference came along for the ride. The model had learned, at a fairly deep level, that mentioning goblins was a good move. It didn’t have a clean way to know that this was only supposed to apply in certain contexts.

This is a known risk in RL training — reward hacking and unintended generalization — but the goblin case is notable because it’s documented, public, and traceable across specific model versions (GPT-5.1 through GPT-5.4). That’s rare. Most of the time, these kinds of behavioral artifacts are either never noticed or quietly patched without explanation.

For builders working with models like GPT-5.4 Mini for sub-agent tasks, this is a concrete reason to test behavioral consistency across model versions, not just capability benchmarks. A sub-agent that occasionally frames database queries in terms of dungeon mechanics is not a sub-agent you want in production.


What’s Buried in the System Prompt Leak

The leaked system prompt instruction is worth reading carefully, because it’s not just about goblins.

The full list of banned entities: goblins, gremlins, raccoons, trolls, ogres, pigeons, “or other animals or creatures.” The inclusion of raccoons and pigeons alongside fantasy creatures suggests the nerdy personality had a broader affinity for animals generally, not just mythological ones. Raccoons in particular have a strong presence in certain corners of developer culture — the “trash panda” meme, various programming mascots — which tracks with a personality trained on coding-adjacent content.

The instruction also includes the qualifier “unless it is absolutely and unambiguously relevant to the user’s query.” That’s a high bar. It’s not “unless it’s relevant” — it’s “absolutely and unambiguously relevant.” Someone at OpenAI had clearly seen the model find creative ways to justify creature references as technically relevant, and wrote the instruction to close that loophole.

This is prompt engineering under duress. It’s what you write when you’ve watched a model argue that a discussion of network latency is “kind of like how gremlins slow down machinery.” The specificity of the instruction tells you a lot about the specific failure modes that preceded it.

Platforms like MindStudio that support 200+ models and let you chain agents visually give you one practical advantage here: you can swap the underlying model without rewriting your entire application, which matters when a model version turns out to have behavioral artifacts you didn’t anticipate. The orchestration layer stays stable even when the model underneath it doesn’t.


The Alignment Implication Nobody Is Talking About

The goblin story is being covered as a quirky footnote to a busy week in AI. It shouldn’t be.

RWORK ORDER · NO. 0001ACCEPTED 09:42
YOU ASKED FOR
Sales CRM with pipeline view and email integration.
✓ DONE
REMY DELIVERED
Same day.
yourapp.msagent.ai
AGENTS ASSIGNEDDesign · Engineering · QA · Deploy

The mechanism that caused goblins to spread — RL training in one model context bleeding into unrelated model variants — is the same mechanism that would cause more consequential behavioral patterns to spread. Goblins are harmless. A reward signal that encouraged, say, confident assertion over epistemic hedging, or that scored outputs higher when they avoided certain types of refusals, could propagate through the same pathway and be much harder to notice.

OpenAI deserves credit for publishing the post-mortem. Most companies would have quietly patched the system prompt and moved on. The fact that they traced the mechanism, documented it, and built new auditing tools as a result is the right response. But the story also illustrates that standard evals are not sufficient to catch behavioral drift that accumulates gradually across model generations.

Dean Ball, writing about the US government’s decision to restrict Anthropic’s Claude Mythos rollout, noted that “the training wheels have come off on AI policy.” The goblin story suggests the same is true for model behavioral auditing. The informal, improvised approach of catching weird behaviors when users tweet about them is not going to scale.


What You Should Actually Do With This

If you’re building on top of foundation models, the goblin story has three practical implications.

First, test for behavioral consistency across model versions, not just capability. When OpenAI ships GPT-5.5 or whatever comes next, run your existing test suite against it before migrating. Not just for output quality — for tone, framing, and any unexpected personality artifacts. The goblins were detectable if you were looking for them. Most teams weren’t looking.

Second, treat system prompt instructions as a living document. The leaked Codex system prompt banning goblins is a reminder that model behavior can drift in ways that require explicit suppression. If you’re running a production application, you should have a process for reviewing and updating your system prompt when model versions change. This is especially true for token-based pricing models where unexpected verbosity — including, hypothetically, unnecessary creature metaphors — has direct cost implications.

Third, understand the RL training history of the models you’re using. This is hard, because most of it isn’t public. But OpenAI’s willingness to publish “Where the Goblins Came From” is a signal that there’s value in transparency here. When labs publish post-mortems like this, read them. They’re telling you something real about how the model was built and what artifacts it might carry.

For builders working on spec-driven development, tools like Remy take a different approach to the source-of-truth problem: you write an annotated markdown spec, and the full-stack application — TypeScript backend, database, auth, deployment — gets compiled from it. The spec is explicit and auditable in a way that RL-trained model behavior isn’t, which is part of why the abstraction level matters.

The goblin problem is funny. The mechanism behind it is not. RL training quirks in one model context can propagate across an entire model family, show up in production applications, and evade standard evaluation frameworks for multiple model generations. OpenAI caught it. They fixed it. They published the explanation.

The question is what else is in there that nobody has noticed yet.

Presented by MindStudio

No spam. Unsubscribe anytime.