Waymo has published a modestly more detailed description of their AI strategy on their blog. In it, they describe the components of their Waymo Foundation Model and how they are trained.
Self-driving teams, especially Waymo which was the earliest, began with use of the more basic forms of AI available in the late twenty-oh-ohs. That included image recognition, prediction and some advanced planning, but most systems follow the path of the state of the art in robotics in that decade. Unless you’ve been under a rock, you’ve seen the revolution in AI that’s come from deep neural networks, reinforcement leaning, transformers and large language models which is shaking the world. No surprise that it’s now at the core of all major self-driving projects. As you might expect, since deep learning was developed by later Google employee Geoff Hinton, and reinforcement learning was largely pioneered by Google subsidiary DeepMind and Transformers were invented at Google, the Alphabet companies, including Waymo, are heavily involved with them.
Some companies hope to use an “end to end” approach. For that, you build an AI engine, typically based on all these concepts, and attempt to train a network that takes sensory input (in particular just video pixels) in one end, and outputs car control commands out the other. This approach requires huge networks and lots of compute power, but it allows factors seen at the “end” of the pipeline, near the final outputs, to send their learnings backwards to the start of the pipeline, which is closer to the pixels. That can make a system very general, but comes with a cost and also makes it harder to understand what’s going on. Tesla has been dedicated to this approach, as have a few others.
Other companies take a hybrid of standard approaches and merge them with AI layers. This lightens the workload of the AI. Today’s LLM based AI tools process strings of tokens, and the more tokens, the more resources they need. With the text AIs you’ve used, the tokens are human language, but they can also be image elements, and motion tokens, describing a scene in terms of identified objects and what they are doing. The secret sauce is to figure out the right way to glue all this together and get the most out of the available resources.
Waymo has built a large general AI model they call the Waymo Foundation Model, training it with all they can about driving, making use of Google’s Gemini LLM which “knows” almost everything that’s been written. This large model becomes the “teacher” which then trains component models which are small enough to run inside a car, rather than on the huge cloud resources that Google has. These “student” components can then handle tasks like the driving problem, of course, but also the job of simulating the driving world to train and test the driver, and the job of being a “critic” which attempts to find mistakes being made by the driver (both on real driving examples and in simulation) so they can be corrected or negatively reinforced.
Waymo’s driving model is now a “Vision Language Model” which means it process both image and text prompts to come up with driving approaches. This is linked, along with their sensor fusion (which combines data from cameras, LIDARs, radars and more) module and their World Decoder, which aims to understand the world and perhaps more importantly to predict what will happen and what should happen in it. While much attention is given to perception in general, the reality is it doesn’t matter where things are now, that’s just an important clue in predicting where they will be in the future, particularly if you might intersect (hit) them. Prediction is what transformer based LLMs and their cousins are wonderfully good at. What ChatGPT, Gemini and their cousins do is predict what he most probably next word should be considering all the words and other instructions that came before, and then they keep doing that. That very simple approach has been impressing us with amazing results. The same rough approach lets the Waymo car understand what’s going to happen in its world, and where it should most likely go.
It’s also important that Waymo works to improve their safety performance by constantly repeatng the process of training its AIs, having its critics, simulations and real world situations try to break them, and then reinforcing the good behaviors and discouraging the bad ones, repeated in what is called a “flywheel.” Waymo, with 100 million real world driving miles under their belts and immensely more simulated ones, feels it has the best collection of data to make a vehicle safe. Tesla also thinks it has the best collection–including the records of hundreds of thousands of drivers using their FSD system in supervised mode, and data from their employee supervised robotaxis. Tesla reports supervised FSD has been used for over 7 billion miles, which is a very impressive number, but they get far less data per mile and get logs of only isolated incidents, making it hard for those of us on the outside to judge whose data corpus is the best.
Waymo is continuing their expansion, starting service now or shortly in 5 cities, and 4 more have been announced, in addition to ambitions in London and Tokyo. The “land rush” of the mid 2020s is happening, though Waymo is the only company in the USA really grabbing the land–there’s more competition in China. (Tesla continues to assert they will have unsupervised robotaxi service in 2025 but have yet to be ready.) While Zoox has started very limited operations in two cities, and May mobility has a few small sites, for now Waymo getts to scale on their own.





