Skip to main content
MindStudio
Pricing
Blog About
My Workspace

Claude Code Found a UTC Timezone Bug by Reading Conversation Transcripts — No API Logs Required

Claude Code read turn 16 of a voice agent transcript and identified a Cal.com UTC bug the developer never spotted. Here's how it works.

MindStudio Team RSS
Claude Code Found a UTC Timezone Bug by Reading Conversation Transcripts — No API Logs Required

Claude Code Read Turn 16 of a Voice Agent Transcript and Found a Bug You’d Never Have Spotted in the Logs

Claude Code identified a UTC timezone bug in a Cal.com tool call by reading raw conversation transcript turn 16 — not by inspecting API logs, not by adding debug print statements, not by manually diffing request payloads. The developer described what felt wrong, and Claude Code went directly to the conversation transcript, found the specific turn where the tool call fired, read the constructed parameters, and said: the search window is in UTC, not Central time.

That’s the finding. The rest of this post is about why it matters and what it tells you about how to debug AI agents.


The bug that looked like a Cal.com problem

The symptom was subtle. A voice agent built on ElevenLabs was calling a check_availability tool backed by the Cal.com API. The developer had set availability from 9 a.m. to 9 p.m. in Cal.com. The agent was returning only a single slot — 6:30 p.m. — when there should have been several hours of open time.

Three hypotheses were on the table:

  1. Cal.com is only returning 6:30 for some reason (Cal.com-side issue)
  2. The agent is querying a too-narrow time window (tool call construction issue)
  3. Cal.com is returning multiple slots but the agent is misreading the output (parsing issue)

Day one: idea. Day one: app.

DAY
1
DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

Any of these could produce the same symptom from the user’s perspective. You hear “I’ve got one slot left today, 6:30 p.m. Central” and you don’t know which layer failed.

The instinct most developers would follow here: open the Cal.com dashboard, check the API logs, maybe add some logging to the tool definition, re-run the agent, compare outputs. That’s three or four manual steps before you even have a hypothesis confirmed.

Claude Code skipped all of that.


What ElevenLabs actually gives you

Before getting to the fix, it helps to understand what the ElevenLabs agent infrastructure looks like here, because the transcript access is a feature of the platform, not something the developer set up.

The agent in this build had four configured components: a persona (system prompt defining a warm B2B sales tone), a voice (a professional voice clone trained on roughly 4 hours of the creator’s voice), a knowledge base, and two Cal.com tools — check_availability and book_appointment. Both tools were configured by Claude Code reading the ElevenLabs API documentation and writing the tool definitions directly, without the developer touching the ElevenLabs dashboard.

The .env file pattern used here is standard: CAL_API_KEY and ELEVENLABS_API_KEY as environment variables, never committed to the repo. The Cal.com integration requires two things: the API key and the event type ID for the specific booking type (in this case, a 30-minute discovery call).

ElevenLabs logs every conversation. In the dashboard, under the agent’s conversation history, you can read the full transcript of every call — including the raw tool call parameters that fired at each turn. Turn 16 in this particular conversation was where check_availability ran. The parameters Claude Code found there showed the search window constructed in UTC rather than Central time. The agent had been told to operate in Central time in its system prompt, but the tool call itself was building timestamps without the timezone conversion applied.

That’s a classic LLM-as-orchestrator failure mode: the model understands the instruction in natural language but doesn’t apply it correctly when constructing structured parameters for a downstream API call.


How Claude Code found it

The developer’s prompt to Claude Code was essentially: “The check availability tool said only 6:30 p.m. is available but the calendar should show 4 p.m. through 9 p.m. Help me figure out what’s going on.”

Claude Code’s response was to go read the conversation transcript. Not to hypothesize. Not to ask clarifying questions. It located turn 16, read the tool call parameters, and identified that the start_time and end_time values being passed to the Cal.com API were in UTC. The developer’s local time is Central (UTC-5 or UTC-6 depending on DST), so a query window that looked like “4 p.m. to 9 p.m.” in the system prompt was actually being sent as “9 p.m. to 2 a.m. UTC” — a window that overlapped with end-of-day or had no availability at all.

The fix was to update the tool call construction to apply the timezone offset before building the timestamp parameters. Claude Code made that change to the agent configuration directly.

One thing worth noting: there was a second, unrelated observation in the same transcript analysis. The system prompt was telling the agent to read back only two or three available slots. That’s a prompt-level constraint, not a bug, but it would have made the availability problem worse if the UTC issue had been producing a sparse result set. Claude Code flagged both but correctly identified the timezone conversion as the primary issue.

After the fix, the agent returned multiple slots — 6:30, 7:00, 7:30, 8:00, and 8:30 p.m. Central — and successfully booked a 7:00 p.m. appointment, sending a confirmation email to the right address. The 6:30 p.m. being the earliest slot (rather than 4:00 or 5:00) was explained by Cal.com’s minimum notice setting: 2 hours. If it’s currently 4:30 p.m., the first bookable slot is 6:30 p.m. That’s not a bug — that’s the platform working as configured.


Why this debugging pattern is underrated

The standard mental model for debugging an AI agent’s tool calls goes something like: add logging, inspect the HTTP request, check the API response, trace the data flow. That works, but it requires you to know in advance which layer to instrument.

What Claude Code did here is different. It treated the conversation transcript as a first-class debugging artifact. The transcript contains not just what the agent said, but what it did — the tool calls it made, the parameters it passed, the responses it received. That’s a complete execution trace, already captured, already readable.

This is actually a general principle worth internalizing: if your AI agent platform logs conversation transcripts with tool call details, that log is often more useful than API-level logs for diagnosing behavioral bugs. API logs tell you what was sent and received. The conversation transcript tells you why the agent sent it — what context it was operating in, what the user had said, what the system prompt had established. Turn 16 in isolation is just a malformed timestamp. Turn 16 in the context of turns 1 through 15 tells you the agent understood “Central time” in natural language but failed to apply it in structured output.

If you’re building agents that chain multiple tool calls — and most production agents do — this kind of agentic workflow pattern becomes essential to understand. The failure mode isn’t usually “the tool doesn’t work.” It’s usually “the tool works but the agent called it with the wrong parameters because of a context or instruction-following gap.”


The session handoff pattern that made this possible

One detail from the build that’s easy to miss: before asking Claude Code to debug the availability issue, the developer did a session handoff. This is a specific technique — a skill, in Claude Code terminology — that clears the current context and injects a structured summary of where the session stands.

The reason this matters for debugging is context rot. In a long Claude Code session where you’ve been building, iterating, changing voices, fixing first-message behavior, and tweaking prompts, the context window fills with noise. When you then ask Claude Code to debug a specific tool call issue, you want it reasoning clearly about that issue — not half-distracted by the 40 previous turns about voice selection and widget configuration.

Not a coding agent. A product manager.

Remy doesn't type the next file. Remy runs the project — manages the agents, coordinates the layers, ships the app.

BY MINDSTUDIO

The session handoff produces a clean context with just the relevant state: what was built, what’s working, what the current problem is. That’s the context in which Claude Code went to read the transcript and found the UTC bug. If you’re running long debugging sessions with Claude Code, building a self-evolving memory system is a natural extension of this pattern — capturing session summaries automatically rather than triggering them manually.


What this means for how you structure agent debugging

The practical takeaway isn’t “use Claude Code to debug ElevenLabs agents” — it’s more general than that.

When you build a voice agent (or any agent with tool calls), the conversation transcript is your primary debugging surface. Structure your platform choice and your logging setup accordingly. ElevenLabs exposes this through the dashboard conversation history. Other platforms expose it differently. Some don’t expose it at all, which is a significant operational disadvantage.

When something goes wrong with a tool call, the first question to ask is: can I read the exact parameters that were passed? Not the parameters you intended to pass. Not the parameters the system prompt described. The actual serialized parameters that hit the API. That’s where the bug lives.

Claude Code’s ability to go read that transcript and identify the specific failure is a function of two things: the transcript was accessible, and Claude Code was given enough context to know what “correct” looked like (Central time, not UTC). Neither of those is magic. Both are things you can engineer into your debugging workflow.

For teams building more complex agent infrastructure — multiple tools, multiple models, conditional routing — MindStudio handles this orchestration across 200+ models and 1,000+ integrations, with a visual builder that makes the data flow between tools explicit rather than implicit in a system prompt. The explicit representation helps with exactly this class of bug: when you can see that the output of step A feeds into step B, it’s easier to identify where a timezone conversion should have happened.


The broader pattern: code beats manual configuration

There’s a secondary point in this build that’s worth making explicit. The entire ElevenLabs agent — persona, voice selection, knowledge base, both Cal.com tools — was configured by Claude Code without the developer touching the ElevenLabs dashboard. The check_availability and book_appointment tools were written by Claude Code after it read the ElevenLabs and Cal.com API documentation.

This matters for debugging because it means the tool definitions live in code, in a repo, under version control. When Claude Code found the UTC bug and fixed it, that fix was a code change — not a dashboard click that might or might not have been saved correctly. The agent configuration is reproducible and diffable.

If you’ve been building agents by clicking through dashboards, this is the argument for moving to code-first configuration. Not because dashboards are bad, but because code is debuggable in ways that dashboard state isn’t. You can read it, diff it, ask an AI to reason about it, and apply fixes that are tracked.

Cursor
ChatGPT
Figma
Linear
GitHub
Vercel
Supabase
remy.msagent.ai

Seven tools to build an app. Or just Remy.

Editor, preview, AI agents, deploy — all in one tab. Nothing to install.

This same principle applies when you’re thinking about the full application layer around an agent. Remy takes a similar approach at the app level: you write a spec — annotated markdown — and it compiles a complete TypeScript application from it, including backend, database, auth, and deployment. The spec is the source of truth; the generated code is derived output. The agent configuration-as-code pattern and the spec-as-source-of-truth pattern are solving the same underlying problem: keeping the authoritative representation of your system in something readable and versionable, not locked in a UI.


One more thing about the widget

The voice agent in this build was deployed as an ElevenLabs widget — a single HTML snippet embedded in a Next.js site, deployed to Vercel via GitHub import with the Next.js preset and a POSTHOG_KEY environment variable. The widget itself is just that snippet. Anyone who inspects your page source can copy it and run your agent on their own site, billing your ElevenLabs account.

The fix is two settings: lock the widget to a specific hostname in the ElevenLabs security settings, and add your domain to the widget’s allow-domains list. Claude Code can make both of those changes without you touching the dashboard. The same way it found the UTC bug by reading a transcript, it can read the ElevenLabs security documentation and apply the right configuration.

The debugging pattern and the security pattern are the same pattern: give Claude Code access to the right artifacts — transcripts, documentation, configuration files — and describe what you want. It will find the specific thing that’s wrong and fix it. The UTC timezone bug in turn 16 is a good example of what that looks like in practice.

If you’re thinking about browser automation and web scraping with Claude Code, or running parallel feature branches while iterating on agent behavior, the same transcript-first debugging approach applies. The conversation log is the execution trace. Read it before you reach for anything else.

Presented by MindStudio

No spam. Unsubscribe anytime.