Tony Zhang, founder & CEO, Tera AI. Ex-Google X AI Lead. Caltech PhD.
In the ever-evolving saga of AI, 2024 will mark another watershed moment akin to the debut of ChatGPT. Yet, this new chapter isn’t penned in words; it’s envisioned through the lens of visual reasoning. This shift from linguistic prowess to visual acuity in AI heralds a transformative era, one where the fusion of sight and cognition in machines promises to redefine our understanding of intelligence itself.
Language To Vision: From Bits To Gigabits
In 2023, AI’s focus on language reasoning, epitomized by the successes of large language models, has driven imaginations to new heights. However, a sole reliance on linguistic data inherently limits AI’s grasp of the human experience and, more importantly, the basis of human knowledge.
Language, while rich in expression, is only a fraction of our cognitive spectrum. The real world isn’t just narrated—it’s seen, felt and interacted with. To reach the full potential of AI’s capabilities, we must venture beyond the confines of text and speech. More importantly, we must assign these inputs information of its point in space and time.
Envisioning Visual World Models
Enter the realm of visual world models—the next leap in foundation models. These systems don’t just respond by generating individual images or text. They model and extract complex patterns contained in visual data across space and time in the trillions, mirroring the human capabilities to derive meaning from all sensory inputs.
This next generation of AI will interpret and autonomously derive insights from billions of images on social media or find patterns in satellite data unbeknownst to humans today. Additionally, these models have the potential to revolutionize how machines learn about and interact with our world and help humans discover fundamental new laws of the natural world.
New Knowledge Discovery
The integration of visual data isn’t just an expansion of AI’s skill set; it’s a gateway to uncharted territories of knowledge. Vision is the foundation upon which much of new human knowledge is created. This is very similar to how a researcher today generates new insights that may take many months or even years to produce, but a machine can do this much more quickly and interrogate the data much more effectively and efficiently.
The Treasure Trove Of Hidden Data
This leap forward hinges on the ability to tap into the vast reservoirs of “hidden data”—visual information that, until now, remained largely unexplored by AI. This includes visual data collected from the physical world. From sources as far ranging as YouTube, government agencies and insurance carriers, this data holds keys to pretraining the world’s most powerful models. Through new ways of training and inference, AI can sift through this data, crystallizing complex information into actionable insights.
Increasing Human Potential
The implications of visually empowered AI go beyond machines that see—it’s about augmenting human vision and cognition. By offloading the task of data analysis to AI, humans can focus on creative, strategic and ethical aspects of problem-solving. This symbiosis between human and machine intelligence paves the way for unprecedented levels of innovation and exploration.
Implementation Challenges
We’ve seen today that most LLMs require substantial fine-tuning on task-relevant enterprise data just to achieve the proficiency level of an entry-level associate. While this is unlikely to hold true with newer models, for world models to generalize to unseen situations while preventing misuse, they will likely require continued alignment to maximize their productive value to organizations.
This will mean thorough backtesting and likely constraining use cases to tasks that are well-defined and tested. As with most early technology, it is unlikely most organizations will be able to manage their own models productively, as we’ve seen with some of the earliest computer vision models for detection. It’s also important to note that models, like humans, can’t know what they don’t see.
Ethics And Limitations
As we stand on the brink of this new era, it’s vital to tread thoughtfully and iterate based on impact. The ethical and societal implications of AI that can see and interpret our world are immense. This journey demands a collaborative approach focused on experimentation as opposed to getting it all right upfront. This will involve not just AI researchers, but ethicists, policymakers and stakeholders.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?