Why Time-To-First-Token Is The Key To Speed And Safety In Physical AI

Lee-Lean Shu, CEO, GSI Technology.

Reaction time is a critical feature of AI. Say you’re trying to rebook a hotel reservation and you get connected to a customer support chatbot. If that chatbot responds quickly to your requests, the interaction can feel helpful. If the response time is slow, you’re more likely to get frustrated and abandon the attempt and switch to another vendor.

This reaction time, known as time-to-first-token (TTFT), is how quickly an AI system generates output after receiving a request. In physical AI applications, like warehouse robots or delivery drones, this reaction speed is critical. That’s because a fast TTFT isn’t a matter of mere convenience; it can actually improve overall safety and increase productivity.

In physical AI, the time it takes to process inputs and respond directly affects how effectively the machine can do its job. For example:

• Microsecond-level responses can control basic motor functions, like maintaining direction.

• Sub-15-millisecond responses can let the robot integrate more complex motor functions.

• Sub-50-millisecond responses can let the robot integrate multiple motor response effects and perform emergency obstacle avoidance.

• Sub-three-second responses can support higher-level awareness and decision-making, allowing more natural obstacle avoidance.

Consider a simple two-wheeled robot in which motor sensors control its motion. When both wheels are aligned, the robot goes straight. To turn, one wheel moves faster than the other. This type of basic system can react to inputs in microseconds.

Integrating AI can add more sophisticated controls, such as torque stabilization or adaptive gait generation, but for the first few tiers of required latency, the models need to be small, tight and maybe even overtrained to meet the specific operational requirements. For instance, at 50 milliseconds of latency, the system can respond at 20 Hz or 20 times per second. This is fast enough for the robot to react to environmental changes in real time without making its motions feel jerky or unnatural.

Research from the Robotics Institute at Carnegie Mellon University notes that human perception of smooth motion requires operation well below the 100-millisecond threshold. Overall, this provides a collision avoidance window of just tens of milliseconds.

The next tier of AI use provides high value from being able to detect unforeseen changes. TTFT speeds govern things like detecting nearby people or obstructions, and then being able to avoid those obstacles and continue operation by making immediate adjustments. In this situation, the robot has a deeper understanding of its environment.

The Three Second Rule

The next level of TTFT is qualitatively different. The system can integrate inputs like video streams, text instructions and audio cues. It can also take into consideration sensor inputs like depth sensing and telemetry.

A key part of this equation is vision-language models (VLM), which though created for picture processing is highly applicable to simultaneously processing multiple input types. With VLM, a robotic system can better detect potentially hazardous operating environment conditions and even predict what humans working nearby might do next.

The latency limit here is three seconds. This is not arbitrary. In human driving, the three-second rule is a safety guideline. Remaining three seconds behind the vehicle ahead of you provides enough distance so that you have time to recognize and react to hazards.

The same principle applies to a robot operating on the factory or warehouse floor. If a human worker is approaching to come into contact or collide with the robot, a sub-three-second response lets the system slow down or switch to safe mode. Any longer than that, the robot may not be able to make the necessary avoidance adjustments due to inertia.

Achieving sub-three-second time-to-first-token across multiple sensors and data types is not difficult by scaling today’s high-performance server-grade chips. The real challenge is doing it within a cost, power and size budget that works for physical AI at the edge. Server-level performance works for machines plugged into wall power. But for battery-operated or remote systems, the breakthrough comes when it can be done using just tens of watts.

What’s needed here, from an architectural perspective, is new, cutting-edge compute-in-memory chips that make it possible to run larger, more complex models without the corresponding increase in hardware requirements. This breakthrough has the potential to provide awareness through AI across various sectors, from data centers doing physical inference to fixed factory floor robots and autonomous mobile robots.

If the robot is outfitted with in-memory compute, it can process data at the edge, eliminating the need for a constant network connection. This matters because the lower latency tiers—responsible for reflexes, coordination and navigation—depend on predictable response times. Even when faced with limited power and limited bandwidth, the robot can complete its tasks safely using onboard sensors to operate in real time, all within the latency windows described earlier.

Safe AI Starts Here

In physical AI, time-to-first-token isn’t just about making robots move and react as fast as possible. It’s also about making sure they operate safely. Each control level plays a different role, and together they determine if a robot can operate at maximum efficiency while avoiding harm.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?