Thinking Machines Lab unveils interaction models for fluid, real-time voice and video AI

Englishعربي

Thinking Machines Lab has unveiled a research preview of its "interaction models," a new class of AI designed for near-real-time voice and video conversations that mimic natural human dialogue rather than rigid turn-taking. Unlike traditional AI systems where users input a prompt and wait for a full response, these models process 200-millisecond chunks of audio, video, and text input while simultaneously generating 200ms of output, enabling continuous overlap, interruptions, and silences to be handled natively. The company announced the preview on Monday, positioning it as a shift from the "chatbot turn" era to fluid collaboration, with a wider public release promised later this year.

According to details shared by Thinking Machines, the models are trained from scratch as multi-stream systems that integrate text, image frames (processed as 40x40 patches), and audio signals like dMel spectrograms. This architecture allows the AI to remain "present" during interactions, even as separate background models handle slower tasks such as reasoning or tool use. Public demos showcase practical applications, including real-time speech translation, detecting animal mentions in stories, and alerting users to posture issues like slouching during video calls.

VentureBeat reports that this approach sets a higher bar for AI by embedding collaboration behaviors directly into the model's core, rather than layering on external tools like speech recognition or turn-detection systems used by competitors. TechCrunch emphasizes the phone-call-like quality, where the AI listens and responds simultaneously, addressing frustrations in current voice assistants from companies like OpenAI and Google, which already offer real-time features but rely on add-ons. Thinking Machines argues that true naturalness requires training these dynamics into the model itself.

The implications extend to everyday work and personal use, potentially transforming scenarios like coding sessions with on-screen hesitation, background searches, or mid-response corrections. As reported across sources, the preview remains closed for now, but it signals a broader industry push toward "agentic" AI that operates independently of human timing. Employees, developers, and general users stand to benefit from more dignified, participatory interactions, though the real test will come in ordinary, multi-modal sessions blending voice uncertainty with visual cues and tools.

This development arrives amid heightened scrutiny of AI interfaces, with Thinking Machines framing its work as preserving human agency in an era where models increasingly shape communication. While established players continue iterating on voice modes, the lab's bet on 200ms cycles could redefine how AI collaborates, pending validation in the promised full release.

Thinking Machines Lab unveils interaction models for fluid, real-time voice and video AI | Srmed