Claude Code Found the UTC Timezone Bug in a Cal.com Tool Call by Reading the Conversation Transcript

Claude Found a UTC Bug in a Cal.com Tool Call by Reading the Conversation Transcript

The Cal.com availability tool was returning only one slot when there should have been five. Claude found the bug without being told where to look — it read the conversation transcript, spotted turn 16, and identified that the check_availability tool had constructed its query window in UTC instead of the user’s local timezone (Central Time).

That’s the thing worth paying attention to here. Nobody said “check the tool parameters” or “look at the timezone handling.” The instruction was something like: “the calendar tool isn’t working right, there should be open slots from 4 p.m. to 9 p.m. but it’s only showing 6:30.” Claude went and read the transcript, found the specific turn where the tool call happened, and diagnosed the root cause from the raw output.

This post is about that debugging move — what it looked like, why it worked, and what it implies for how you think about AI-assisted debugging of tool calls.

What the Bug Actually Was

The setup: a voice agent built with 11 Labs, connected to Cal.com via two tools — check_availability and book_appointment. The agent’s job was to find open slots and book a 30-minute discovery call for prospective clients.

Plans first. Then code.

PROJECTYOUR APP

SCREENS12

DB TABLES6

BUILT BYREMY

1280 px · TYP.

yourapp.msagent.ai

A · UI · FRONT END

Remy writes the spec, manages the build, and ships the app.

Cal.com was configured with availability from 9 a.m. to 9 p.m. on weekdays. The user had just extended it to 9 p.m. specifically to make more slots available for the demo. So when the agent reported only one slot — 6:30 p.m. — that was clearly wrong. There should have been slots at 4:00, 4:30, 5:00, 5:30, 6:00, 6:30, and so on.

Three things could have been wrong:

Cal.com was actually returning only one slot (something wrong on the Cal.com side)
The agent was querying a limited time window (tool call constructed incorrectly)
Cal.com was returning multiple slots but the agent was misreading the output

Claude identified it as option 2. The check_availability tool was constructing its datetime parameters in UTC. The user is in Central Time, which is UTC-5 (or UTC-6 depending on DST). So when the agent asked “what’s available from now until end of day,” it was computing “end of day” in UTC — which, from Central Time’s perspective, was already hours earlier than the user expected.

The result: a query window that looked correct in the code but was shifted by 5-6 hours in practice. Most of the available slots fell outside the window the tool was actually checking.

Why This Kind of Bug Is Hard to Catch

Timezone bugs are annoying for a specific reason: the code doesn’t error. The API call succeeds. You get a valid response back. The response just represents a different time window than you intended.

If you were debugging this manually, you’d probably start by checking whether the Cal.com API key was working, then whether the tool was being called at all, then whether the response was being parsed correctly. Timezone handling is often the last thing you check because everything looks like it’s working.

The other reason this is hard: the bug lives in the parameters being sent to the API, not in the code that processes the response. You can read the tool definition a dozen times and not see it, because the logic for constructing the datetime string might look correct in isolation. It’s only when you see the actual constructed value — the literal string being sent — that the problem becomes obvious.

This is also why managing context carefully in Claude Code sessions matters so much during debugging. If your context is cluttered with earlier iterations and failed attempts, Claude has more noise to work through when it’s trying to isolate what actually happened.

How Claude Found It

The debugging session started with a session handoff — a skill that clears the context and provides a summary of where things stand, so the next conversation starts clean. This is worth doing before a debugging pass because you want Claude reasoning about the current state, not the history of how you got there.

The prompt was roughly: “The check_availability tool said we only have 6:30 p.m. available, but we should see 4 p.m. all the way to 9 p.m. Help me figure out what’s going wrong.”

Claude’s response was to go read the conversation transcript from the 11 Labs dashboard. The 11 Labs platform logs all conversations, including the raw tool call inputs and outputs. Claude looked at turn 16 in the transcript — the specific turn where check_availability was called — and read what the agent had actually sent.

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

That’s where it found the UTC construction. The agent had built the query window using UTC datetimes, not Central Time datetimes. The fix was to change the tool parameter handling so that the availability query was anchored to the user’s local timezone.

What’s notable is that Claude didn’t need to be told “look at the tool call parameters” or “check the timezone handling.” It reasoned about where the bug could be (three possible failure points), identified that the transcript would contain the actual tool call inputs, went and read it, and found the specific line that showed the problem.

The Transcript as a Debugging Surface

This is the part that’s actually interesting from a tooling perspective.

When you’re debugging a tool call in a voice agent, you have a few options. You can add logging to the tool itself. You can inspect the API response. You can read the agent’s system prompt and try to reason about what it would do. Or — if your platform logs it — you can read the raw conversation transcript, which includes the actual inputs the agent constructed and sent.

The 11 Labs platform logs this. Every conversation is stored, timestamped, and readable. You can listen to the audio, read the transcript, and see the tool call inputs and outputs inline with the conversation flow.

Claude used that as a debugging surface. It’s the same thing a human engineer would do if they knew to look there — but the human would have to know to look there, know what they were looking for, and know how to interpret what they found. Claude did all three steps without being prompted through them.

For context: the voice agent had been built, debugged, and iterated to a working state in approximately 45 minutes using only natural language prompts. The UTC bug was found and fixed within that window. No API documentation was read by the human. No manual inspection of tool parameters happened on the human side.

What This Tells You About Debugging Tool Calls

There’s a pattern here that’s worth making explicit.

When a tool call produces wrong results, the failure is almost always in one of three places: the parameters being sent, the response being parsed, or the logic that decides when to call the tool. The transcript — if your platform logs it — shows you all three. You can see what triggered the call, what was sent, and what came back.

Claude’s approach was to treat the transcript as evidence rather than trying to reason about the bug from first principles. That’s actually the right move. First-principles reasoning about timezone bugs is slow and error-prone. Reading the actual constructed value is fast and unambiguous.

The implication for you: when you’re building agents with tool calls, make sure you have access to the raw inputs being sent. Not just “did the tool get called” but “what exact parameters did it send.” If your platform logs this (11 Labs does, in the conversation transcript), that log is your primary debugging surface. Point Claude at it before you start reasoning about what might be wrong.

Day one: idea. Day one: app.

DAY

DELIVERED

Not a sprint plan. Not a quarterly OKR. A finished product by end of day.

This connects to a broader point about Claude Code token management: when you’re doing a debugging pass, you want Claude spending its context on the relevant evidence — the transcript, the tool definition, the specific failure — not on the full history of how the agent was built. The session handoff before the debugging pass wasn’t incidental. It’s what made the debugging efficient.

The Other Constraint That Was Hiding

Once the UTC bug was fixed, there was still a question: why wasn’t 4:00 or 4:30 p.m. available, even with the corrected timezone?

The answer was in Cal.com’s settings, not in the tool call. Cal.com has a “minimum notice” setting — in this case, 2 hours. If the current time is 4:30 p.m., the earliest bookable slot is 6:30 p.m. That’s not a bug. That’s the system working as configured.

This is a good example of a hidden constraint that looks like a bug. The tool call was returning correct data. The agent was reading it correctly. The available slots genuinely started at 6:30 because of the minimum notice rule. The only reason it looked wrong was that the UTC bug had been masking it — when the query window was shifted by 5-6 hours, you’d get different (also wrong) results, and it was hard to tell which problem was which.

After the UTC fix, the minimum notice constraint became visible and made sense. This is also why fixing one bug at a time matters: if you try to fix everything simultaneously, you can’t tell which fix resolved which symptom.

What You’d Want to Replicate

If you’re building voice agents with tool calls and want to be able to debug them the way Claude did here, a few things need to be in place.

Transcript logging. You need to be able to read the raw tool call inputs and outputs, not just the conversation text. 11 Labs logs this in the conversation history. If you’re using a different platform, check whether it exposes this — and if it doesn’t, add logging at the tool level.

Clean context before debugging. The session handoff pattern — clearing context and starting fresh with a summary — meant Claude was reasoning about the current state, not the accumulated history. For debugging specifically, this matters. You want the model focused on the failure, not on everything that came before it.

Specific failure description. The prompt wasn’t “the calendar tool is broken.” It was “the tool said we only have 6:30 p.m. available, but we should see 4 p.m. all the way to 9 p.m.” That specificity is what let Claude reason about the three possible failure points and then go look for evidence.

Plan mode before execution. The agent was built using plan mode first — Claude asked five clarifying questions before writing any code, including questions about how the Cal.com booking should work (direct from 11 Labs to Cal.com, not through an intermediary). That upfront alignment meant the tool definitions were closer to correct from the start, which made the UTC bug the only real issue rather than one of many. If you’re not already using Opus plan mode to separate planning from execution, it’s worth the habit.

Other agents ship a demo. Remy ships an app.

React + Tailwind ✓ LIVE

API

REST · typed contracts ✓ LIVE

DATABASE

real SQL, not mocked ✓ LIVE

AUTH

roles · sessions · tokens ✓ LIVE

DEPLOY

git-backed, live URL ✓ LIVE

Real backend. Real database. Real auth. Real plumbing. Remy has it all.

The .env file handling was also clean throughout: Cal.com API key and 11 Labs API key stored in .env, excluded from git automatically. Small thing, but it’s the kind of operational hygiene that prevents a different class of problems from appearing during debugging.

Platforms like MindStudio take a different approach to this orchestration layer — 200+ models, 1,000+ integrations, and a visual builder for chaining agents and workflows — which can reduce the surface area where timezone and parameter bugs hide, since the integration plumbing is handled at the platform level rather than in hand-written tool definitions.

The Broader Point

The UTC timezone bug is a specific, fixable problem. But the interesting thing isn’t the bug — it’s the debugging method.

Claude read the transcript. It identified the relevant turn. It found the constructed parameter value. It diagnosed the root cause. It did this without being told where to look.

That’s a different kind of AI assistance than “write me a tool call.” It’s closer to having a debugging partner who knows to check the actual evidence rather than reason from assumptions. The value isn’t that Claude knows more about Cal.com’s API than you do. It’s that it went and looked at what actually happened, rather than what should have happened.

For anyone building agents with external tool calls — Cal.com, Google Calendar, Stripe, whatever — the practical takeaway is this: your conversation transcript is a debugging artifact. Treat it like one. And if something’s returning wrong results, the first move is to read what was actually sent, not to reason about what should have been sent.

Tools like Remy operate on a similar principle — the spec is the source of truth, and the generated output is derived from it, so when something’s wrong you fix the spec and recompile rather than chasing symptoms in the generated code. The debugging philosophy is the same: go back to the source of truth, read what’s actually there, and fix it at that level.

The NATO phonetic spelling that was removed from the system prompt after it slowed down the email confirmation flow — that was found the same way. Someone noticed the behavior, described it specifically, and Claude fixed the prompt. The transcript is always the evidence. Read the transcript.