What Is Thinking Machine's Interaction Model? Time Tokenization Explained

Why Real-Time AI Still Feels Slightly Off

You’ve probably noticed it. You’re talking to an AI voice assistant and there’s that small, uncomfortable delay — the half-second pause where you can’t tell if it’s thinking, if it heard you, or if you should keep talking. Then the response comes, but by then the moment has passed.

That’s not a bandwidth problem or a hardware limitation. It’s a fundamental architectural one. Most AI models, including the voice modes in GPT-4o and Gemini Live, weren’t built with time as a first-class concept. They process tokens — chunks of text, audio, or image data — but they don’t inherently understand the passage of time during a conversation.

Thinking Machine’s Interaction Model (TML) takes a different approach. It treats time itself as something to tokenize — breaking it into discrete 200ms chunks that the model uses as native inputs. The result is a system designed for real interaction, not simulated real-time.

This article explains what time tokenization is, how TML implements it, and why it matters for the broader trajectory of conversational AI.

The Core Problem with Existing Real-Time AI Models

To understand what TML solves, it helps to understand how existing systems handle live audio interaction.

The Turn-Based Trap

REMY IS NOT

✕a coding agent
✕no-code
✕vibe coding
✕a faster Cursor

IT IS

✓a general contractor for software

The one that tells the coding agents what to build.

GPT-4o’s voice mode, in its base implementation, relies on voice activity detection (VAD). The system listens for audio input, detects when you’ve stopped speaking (usually based on a silence threshold), then processes the full audio and generates a response. It’s quick — sometimes under a second — but it’s fundamentally sequential.

This creates problems:

You can’t interrupt mid-sentence without the model ignoring you or getting confused
Pauses within your speech (normal thinking pauses) can trigger a premature response
The model has no understanding of how long you’ve been quiet, how long its own response has been running, or the temporal rhythm of the conversation

Streaming Audio Isn’t the Same as Time Awareness

Gemini Live and similar streaming multimodal systems improve on turn-based VAD by processing audio in real time — they start reading your input before you finish speaking. But streaming audio and understanding time are different things.

A model can receive a continuous stream of audio tokens and still have no concept of whether 200ms passed between two tokens or 2 seconds did. The tokens describe what happened; they don’t encode when.

Why Time Matters in Conversation

Human conversation is deeply temporal. We use timing as a communication channel:

A 400ms pause before answering signals you’re thinking
A 50ms gap signals you’re ready to continue
Speaking over someone signals urgency or excitement
Silence after a question signals discomfort or uncertainty

If an AI can’t perceive these signals, it can’t respond to them appropriately. It can only respond to the content of what was said, not the full communicative act.

What Is Time Tokenization?

Time tokenization is the practice of converting continuous time into discrete, fixed-length units that a model can process natively — the same way text tokenization converts a sentence into a series of word-piece tokens.

In TML’s implementation, the model divides the interaction timeline into 200ms windows. Each window becomes a token (or set of tokens representing what happened in that window: audio signal, silence, metadata). These time tokens are fed into the model alongside content tokens.

What 200ms Represents

At 200ms per chunk, the model processes 5 tokens per second of real time. That granularity is:

Fine enough to distinguish a thinking pause (~300–500ms) from an end-of-speech signal (~800ms+)
Fine enough to detect interruptions or overlapping speech
Coarse enough to stay computationally tractable in real-time inference

This is not arbitrary. Human reaction time is roughly 200–250ms for simple auditory stimuli. Building the time resolution around that threshold means the model is operating at a granularity that corresponds to meaningful human perceptual units.

Time Tokens vs. Audio Tokens

Audio tokens encode the content of sound — phonemes, speaker characteristics, acoustic features. Time tokens encode the structure of when things happen.

In TML, these two types of tokens work together. The model doesn’t just know what you said; it knows when you said it, how long each part took, where the silence fell, and how that pattern relates to what’s come before.

How TML’s Architecture Works

TML is built on the premise that interaction is fundamentally a temporal phenomenon. Its architecture reflects that at several levels.

Token Streams, Not Turn Sequences

Remy is new. The platform isn't.

Remy

Product Manager Agent

THE PLATFORM

200+ models 1,000+ integrations Managed DB Auth Payments Deploy

▮

BUILT BY MINDSTUDIO

Shipping agent infrastructure since 2021

Remy is the latest expression of years of platform work. Not a hastily wrapped LLM.

Traditional LLMs — and most voice models — operate on conversation turns. User speaks → model processes → model responds → repeat. The conversation exists as a list of alternating inputs and outputs.

TML operates on parallel token streams. There’s a user audio stream, a model output stream, and a time stream — all running simultaneously and all visible to the model at inference time. The model doesn’t wait for a “turn” to end. It continuously updates its understanding as time passes.

Silence as Signal

In a turn-based system, silence between words is typically stripped or compressed. It’s treated as nothing.

In TML, a 200ms window of silence is still a token. It gets processed. The model learns, through training, that different durations and placements of silence carry different meaning. A two-second silence after a complex question means something different than a 200ms silence mid-sentence.

Interrupt Handling

Because TML processes time in real-time parallel streams, it can detect and respond to interruptions naturally. If the model is generating a response and you start speaking, both events exist in the time token stream simultaneously. The model can recognize the overlap pattern and stop, pause, or adjust — the same way a person would.

This is a meaningful departure from systems that either ignore interruptions or require special engineering workarounds to handle them.

TML vs. GPT-4o Voice Mode vs. Gemini Live

Each of these systems takes a meaningfully different approach to real-time interaction.

GPT-4o Voice Mode

GPT-4o’s voice capability runs audio end-to-end through the model (rather than transcribing to text first, as earlier voice implementations did). This reduces latency and preserves prosody. But the interaction model is still largely turn-based at the user-facing level.

The model is fast and capable of generating expressive, natural-sounding audio responses. However, its timing awareness is limited. It can detect broad patterns (like you’re done speaking) but isn’t tracking the temporal structure of the conversation as a sequence of 200ms time units.

Best for: Fluid voice conversations where turn structure is relatively clear. Less suited for rapid back-and-forth or conversational contexts with lots of overlapping speech.

Gemini Live

Gemini Live (part of Google’s Gemini ecosystem) is one of the most capable streaming voice systems currently publicly available. It handles interruptions better than most competitors and maintains conversational context well.

Its streaming architecture means it processes audio continuously rather than in discrete turns. But like GPT-4o, it doesn’t treat time itself as a token type — it processes audio content that happens to arrive in real time.

Best for: Extended voice sessions with a more natural flow. Good interruption handling relative to competitors. Strong multimodal integration.

Thinking Machine’s TML

TML’s core differentiation is structural. Time is part of the model’s input representation, not just a property of the delivery mechanism.

Feature	GPT-4o Voice	Gemini Live	TML
Audio processing	End-to-end audio tokens	Streaming audio	Parallel time + audio streams
Time representation	Implicit	Implicit	Explicit (200ms tokens)
Interrupt handling	Limited	Good	Native
Silence as signal	Stripped/threshold	Compressed	Tokenized
Conversation model	Turn-based	Streaming	Temporal parallel streams

Everyone else built a construction worker.
We built the contractor.

🦺

CODING AGENT

Types the code you tell it to.
One file at a time.

🧠

CONTRACTOR · REMY

Runs the entire build.
UI, API, database, deploy.

The practical effect of this architectural difference is that TML is designed to feel less like talking to a system and more like talking with one — because the model is tracking the same temporal signals a human listener would track.

Why 200ms Is a Meaningful Unit

The 200ms chunk size isn’t just a design choice made for computational convenience. It connects to real neuroscience and psycholinguistics research.

Auditory Perception Windows

The human auditory system integrates sound over roughly 200–300ms windows when processing speech. This is sometimes called the “perceptual present” for speech — the timeframe over which you’re simultaneously holding and integrating acoustic information before forming a meaningful unit.

By aligning the model’s time resolution to this window, TML is working at a granularity that corresponds to how humans perceive and process conversational timing.

Turn-Taking Research

Research in conversation analysis shows that turn transitions in human dialogue typically happen within 200–300ms of a turn-final signal. People are remarkably accurate at predicting when another speaker is about to stop — and they begin preparing their response before that moment arrives.

A model that tokenizes time at 200ms resolution can, in principle, learn these predictive patterns from training data. It can recognize turn-final cues and begin generating a response at the right moment — not a second after the silence threshold trips.

The Floor Effect

Below roughly 150ms, acoustic events start becoming indistinguishable as separate from each other in natural speech processing. Going finer than 200ms adds computational overhead without adding meaningful information for conversational interaction. Going coarser (say, 500ms) loses too much resolution.

200ms sits at a productive middle: fine enough to matter, coarse enough to stay real-time.

What Time Tokenization Enables in Practice

The architectural choice to tokenize time unlocks a set of interaction capabilities that are difficult or impossible to achieve with content-only token streams.

Backchanneling

Backchannels are the small responses listeners make during speech: “mm-hmm,” “right,” “yeah.” They happen mid-utterance, without breaking the speaker’s turn. They signal active listening.

For an AI to produce or recognize backchannels, it needs to understand the real-time flow of conversation, not just process it after the fact. Time tokenization provides the necessary resolution.

Prosodic Response

Prosody — the rhythm, stress, and intonation of speech — carries meaning beyond words. A rising intonation signals a question. Slowing pace signals emphasis. Increased speed signals urgency.

When time is tokenized, prosodic patterns become learnable features. The model can respond not just to what was said but to how it was paced.

Multi-Party Conversation

In group conversations, multiple people speak simultaneously, interrupt each other, and use silence strategically. Systems that only handle one audio stream at a time struggle here.

TML’s parallel stream architecture, combined with time tokenization, makes multi-party conversation more tractable — each speaker’s audio stream exists in the same temporal coordinate system, and the model can track them in relation to each other.

Where MindStudio Fits Into This Picture

The emergence of interaction models like TML represents a broader shift in how AI systems handle real-world communication. As these models move from research into production availability, the practical question becomes: how do you build applications that use them?

That’s where platforms like MindStudio become relevant.

✗ VIBE-CODED APP

Tangled. Half-built. Brittle.

✓ AN APP, MANAGED BY REMY

UIReact + Tailwind✓

APIValidated routes✓

DBPostgres + auth✓

DEPLOYProduction-ready✓

Architected. End to end.

Built like a system. Not vibe-coded.

Remy manages the project — every layer architected, not stitched together at the last second.

MindStudio gives you access to 200+ AI models — including multimodal and voice-capable models — through a single interface, without needing to manage API keys, handle authentication, or set up separate accounts for each provider. As new models with real-time interaction capabilities become available, they get added to the platform.

If you’re building voice agents, customer-facing AI applications, or any workflow that involves real-time interaction, you can build and deploy those agents on MindStudio in hours rather than weeks. The platform handles infrastructure, model routing, and integrations — you focus on what the agent should actually do.

You can try MindStudio free at mindstudio.ai.

Frequently Asked Questions

What is time tokenization in AI models?

Time tokenization is the process of dividing continuous time into fixed-length discrete units — in TML’s case, 200ms chunks — and encoding those units as tokens that the model can process. It’s analogous to how text tokenization breaks sentences into word-pieces. The result is a model that has an explicit, native understanding of when things happen in a conversation, not just what was said.

How does TML differ from GPT-4o’s voice mode?

GPT-4o’s voice mode processes audio end-to-end, which reduces latency and improves audio quality. But it uses a largely turn-based interaction model where time is implicit rather than explicitly tokenized. TML treats time as a first-class token type, running parallel streams for audio content and temporal position simultaneously. This gives TML native interrupt handling, silence recognition, and prosodic awareness that GPT-4o’s voice mode approximates through other means.

Why does the 200ms chunk size matter?

200ms aligns with the human auditory perceptual window for speech processing and corresponds to the typical speed of turn-transitions in human conversation. It’s fine-grained enough to distinguish meaningful conversational signals (thinking pauses vs. end-of-speech vs. interruptions) while remaining computationally tractable for real-time inference. Finer resolution adds overhead without adding meaningful information; coarser resolution loses too much.

Is time tokenization the same as streaming audio?

No. Streaming audio means the audio signal is delivered to the model in real time rather than after you finish speaking. Time tokenization means time itself is represented in the model’s token vocabulary. A model can receive streaming audio without treating time as a token — Gemini Live does this. Time tokenization adds a second layer: the model doesn’t just receive audio that happens to arrive in real time, it has an explicit representation of temporal structure as part of its input.

What are the limitations of TML’s approach?

Time tokenization increases the complexity of the token stream. At 5 time tokens per second, a 10-minute conversation generates 3,000 time tokens on top of the audio content tokens — which has implications for context window usage and inference cost. The approach also requires training data that accurately captures the temporal structure of human conversation, which is harder to collect and annotate than text. And like all voice AI systems, performance degrades with noisy audio environments.

Can I build real-time voice agents without using TML specifically?

Yes. Several capable real-time voice models are available today, including GPT-4o’s voice mode and Gemini Live, both of which handle real-time interaction well for most use cases. TML’s value is in edge cases where the temporal structure of conversation matters most: interruption-heavy interactions, multi-party conversations, and applications where natural turn-taking cues are critical. For many voice agent use cases, existing models are sufficient.

Key Takeaways

Most current real-time AI voice systems treat time as a delivery mechanism rather than a native part of the model’s input representation.
TML encodes time as explicit 200ms tokens, giving the model a native understanding of when events happen — not just what happened.
The 200ms resolution corresponds to meaningful human perceptual and conversational timing thresholds, making it a principled design choice rather than an arbitrary one.
This enables capabilities like natural interrupt handling, silence recognition, and backchanneling that are difficult to achieve with content-only token streams.
GPT-4o and Gemini Live are capable systems with different architectural tradeoffs — TML’s approach is distinct in treating temporal structure as a first-class problem.
As real-time interaction models mature, platforms like MindStudio provide a practical layer for building applications on top of them without rebuilding infrastructure from scratch.