It may be humanity’s largest art project ever–teaching machines to understand the art of how to be human.
“You can’t actually teach a machine to understand humans unless you also teach them to understand human emotion,” said Hassaan Raza, co-founder and CEO of San Francisco-based Tavus, an AI research lab and developer platform. Raza shared insights via an interview.
Much of the human emotion Raza speaks of is centered in the human face, where dozens of muscles interact in myriad ways to create complex expressions. Along with vocal intonations and gestures, meaning is conveyed in ways that go far beyond words.
“A lot of our research focuses on AI’s ability to see and recognize gestures and facial expressions, with the AI agent determining, ‘Hey, does this person look happy, sad, or maybe tired?’” Raza said. “Systems can measure emotion, tonality and reciprocity, and use those as signals to respond appropriately.”
The technology, however, remains relatively broad; granular cues such as blink rate, pupil dilation and subtle micro-expressions can be missed. And such cues can vary widely among cultures. While silence may signal discomfort in Western cultures, it can convey receptiveness and agreement in many Asian cultures.
What Is Conversational Video AI?
Conversational video AI enables real-time, face-to-face video conversations with AI agents that can interact just like a human. Not all systems present as human—robotic, animal-like and object avatars exist. But the greatest demand is for human-centric avatars that can build authentic rapport.
The global conversational AI market continues to grow. It was valued at $14.79 billion in 2025 and is projected to reach $17.97 billion in 2026 and $82.46 billion in 2034, according to Fortune Business Insights, a global market research and consulting firm.
Why AI Still Struggles With Natural Human Conversation
Currently, the gap between surface-level mimicry and genuine understanding is still wide.
The most evident marker of this gap is what Nate MacLeitch, founder and CEO of Quickblox, calls “hesitation markers and filled pauses,” the subtle vocal cues humans use to indicate that they’re still thinking. London-based Quickblox is a cloud-based communications platform offering AI features. These cues allow for near-instant handoffs from speaker to speaker—clocked at from 200-500 milliseconds, said MacLeitch in an email response to questions. “If they haven’t quite finished, (humans) tend to fill gaps with ‘thinking’ noises, whereas most AI systems freeze while processing.”
The challenge for conversational AI isn’t just replicating the surface look of emotional expressions, but developing what researchers call “theory of mind reasoning,” the ability to model what another human might be thinking or intending—or even about to think.
“Being ‘human-like’ isn’t about perfectly synchronized micro expressions, it’s about shared intentionality: the ability to have common goals, context, and moral grounding in interaction,” said RaviKumar Bhuvanagiri in an email response to questions. Bhuvanagiri is an independent researcher at the McCombs School of Business, University of Texas at Austin. “Current systems map correlations, not relationships,” he said.
Top leaders in interactive conversational video AI include Tavus and D-ID. HeyGen and Synthesia focus on studio/scripted video for training and marketing, although both are exploring interactive capabilities. The infrastructure and platform providers include Microsoft, Google, Meta, Nvidia and OpenAI.
Tavus recently launched PALs (Personal Agent Layers) that appear human and can see, listen and retain conversations. Users can video chat with these “adaptive companions” who also interact via text and email.
Rather than creating one general-purpose assistant, Tavus designed five distinct personalities tailored to different user needs and interaction styles.
Agent Dominic is billed as an “old-school English butler,” a domestic organizer adept at sorting out the chaos in users’ lives. Chloe combines emotional support and productivity help. Noah tells hard truths for those seeking honest advice and authentic friendship—“like the older brother you never had,” according to Tavus. Gossipy Ashley (”terminally online”) is a media junkie who assists with creative projects and tracks pop-culture trends. Charlie is a techie who’s built to geek out with users on a wide variety of tasks.
The Tavus AI agents (all who look to be in their early 30s except for Dominic, who’s 50ish) evolve and change over time, learning users’ communication styles and preferences. The agents can see users and observe and comment on facial, body and environmental cues: “You seem a bit sad today, what’s up?” or “I like that Art Deco lamp near your window; where did you find it?”
Tavus has also developed an AI Santa Claus with the same capabilities as its PALs—proven popular with both adults and children during the holidays.
Along with Santa, Tavus recently rolled out Sparrow-1, a conversational-flow control model that brings greater human-level timing to real-time voice and video AI.
Still, today’s AI agents can fail to recognize things far more subtle than Art Deco lamps: humor, sarcasm and other tonal shifts, “leading to robotic responses that miss the point,” said MacLeitch. Even when an AI agent detects an emotion like anger, “it often lacks the depth to understand the underlying cause or provide a nuanced response,” he said. “Distinguishing these based on trigger words and non-verbal cues and building thoughtful escalation paths will be an ongoing challenge as 2026 unfolds.”
The art of being human includes admitting mistakes, a feature developers are increasingly weaving into AI systems.
“Epistemic humility models explicitly track uncertainty, letting AI admit when it doesn’t know rather than bluffing coherence,” said Jigyasa Grover, in an email response to questions. Grover is a Machine Learning engineer and author of the book, Sculpting Data for ML: The first act of Machine Learning.
Grover sees a challenge and opportunity to move from “hyper-realistic mimicry to AI that can maintain joint attention, reason through ambiguity, and actively participate in conversation like a human, not just look like one. That’s where the next leap in human-like interaction lies.”







