Multimodal Fusion Used In Self-Driving Cars Is Uplifting AI That Provides Mental Health Guidance

In today’s column, I examine the use of multimodal fusion in the rapidly evolving realm of AI that provides mental health support.

Readers might recall that I’ve previously discussed the emerging use of multimodal media capabilities in generative AI and large language models (LLMs), see my coverage at the link here and the link here. The idea is that rather than primarily focusing on text as a mode of communication with AI, we can add the use of audio, images, video, and other modes of media.

Many of the existing AI platforms do not particularly integrate multiple modes. You are either doing something with text interaction, or with audio interaction, or with images, or with video, etc. But it is rarer to have those fully intertwined.

This brings up the need to have AI undertake multimodal fusion. The fusion brings together numerous disparate modes. AI can seamlessly utilize any of the multimedia modes. The key is that each mode bears upon the other mode. What is taking place via text gets integrated with the audio, and with the video, and so on. They get tied together in a nifty shiny bow. Many of these fusion techniques are borrowed from the self-driving cars realm, whereby the AI of an autonomous vehicle needs to fuse together disparate modes of data, such as from cameras, radar, LIDAR, sonar, and the like (see my in-depth coverage of multi-sensor data fusion, MDSF, at the link here).

One new focus that is already substantially benefiting from multimodal fusion and MDSF is in the mental health realm. Advances are already underway, and we will soon be witnessing very impressive results. For earlier efforts on such fusion, see my coverage at the link here.

Let’s talk about it.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

AI And Mental Health

As a quick background, I’ve been extensively covering and analyzing a myriad of facets regarding the advent of modern-era AI that produces mental health advice and performs AI-driven therapy. This rising use of AI has principally been spurred by the evolving advances and widespread adoption of generative AI. For a quick summary of some of my posted columns on this evolving topic, see the link here, which briefly recaps about forty of the over one hundred column postings that I’ve made on the subject.

There is little doubt that this is a rapidly developing field and that there are tremendous upsides to be had, but at the same time, regrettably, hidden risks and outright gotchas come into these endeavors, too. I frequently speak up about these pressing matters, including in an appearance last year on an episode of CBS’s 60 Minutes, see the link here.

Background On AI For Mental Health

I’d like to set the stage on how generative AI and large language models (LLMs) are typically used in an ad hoc way for mental health guidance.

Millions upon millions of people are using generative AI as their ongoing advisor on mental health considerations (note that ChatGPT alone has over 800 million weekly active users, a notable proportion of which dip into mental health aspects, see my analysis at the link here). The top-ranked use of contemporary generative AI and LLMs is to consult with the AI on mental health facets; see my coverage at the link here.

This popular usage makes abundant sense. You can access most of the major generative AI systems for nearly free or at a super low cost, doing so anywhere and at any time. Thus, if you have any mental health qualms that you want to chat about, all you need to do is log in to AI and proceed forthwith on a 24/7 basis.

Today’s generic LLMs, such as ChatGPT, Claude, Gemini, Grok, and others, are not at all akin to the capabilities of human therapists. Meanwhile, specialized LLMs are being built to presumably attain similar qualities, but they are still primarily in the development and testing stages. See my coverage at the link here.

The Usual Singular Mode Of Text

By and large, when you use AI to get mental health guidance, the odds are that you will do so via text mode only. That’s just the way things are technologically right now.

You enter a prompt telling the AI that you are dealing with a bout of depression. The AI asks you when this first began. You respond by writing a brief description of the depression that started about two weeks ago. Back and forth this goes. It is entirely text-based.

Human therapists would be less likely to carry on a therapy session solely via text.

It could happen if, for some reason, the therapist and the client could not activate a Zoom-like session or meet in person. Likewise, it could happen if the two weren’t able to speak on the phone. Thus, I’m not saying that texting as a form of therapy interaction isn’t undertaken, but just that it is a shallow second-best option and not one that is readily desired. In a pinch, sure, it can be done that way.

The conventional approach is that a human therapist and a client are able to communicate face-to-face. Why does that matter? The therapist can discern a slew of important clues by seeing how a client physically responds and acts. Facial expressions tell quite a story. Even the tone of voice and the sound of utterances are substantial clues. Text alone doesn’t contain the same dynamism as these additional divulging indicators.

Furthermore, not everyone is necessarily adept at writing, per se. Trying to write about your bout of depression might be especially difficult. Your written words fail to convey what is going on. Text is relatively slow due to the speed at which someone is typing. Voice is usually faster as a mode of communication. It is easier for people to generally articulate their thoughts via speaking rather than by typing them.

Multimodal Fusion At The Input

Envision that we have a person chatting via text with AI about a mental health topic. The person opts to turn on the camera on their laptop or smartphone. The video being streamed is probably not going to make a difference to the AI. That’s because texting is the mainstay of what the AI has been built to utilize.

Multimodal fusion integrates multiple modes.

Once the video feed is active, an AI primed with multimodal fusion not only parses the text of the user, but the AI also scans the live video of the person. Are they smiling or sad? Do they look okay or perhaps highly distressed? There is a match made between the text the user is entering and the appearance of their face and physical mannerisms.

Suppose the person says in their texting that they are buoyant and believe they have fully overtaken the depression they once had. Meanwhile, the AI scans the face of the person in real-time and computationally notices that their expression is the complete opposite of being pleased. The face represents a different sentiment and seems to be at odds with the text.

If the AI were relying only on text, the AI would likely assume that the person is straightforward about believing they have overcome their depression. The topic at hand would move on and no longer involve the exploration of depression aspects. Instead, since the texting and the video do not appear to coincide suitably, the AI can gingerly explore what is really taking place with the person.

More Multimodal Fusion At Input

In the scenario that I’ve so far outlined, the user is texting and providing a streaming video of their presence. That’s two modes taking place at the same time. We can easily ratchet up the ante.

The person activates the microphone on their smartphone or laptop. This adds an audio element to the chatting. In addition, they have some pictures they took a few days ago when they were totally in a funk and felt extremely depressed. The person sends the photos to the AI or gives the AI access to the photos that are on their device.

You can plainly see that the modes are starting to pile up. We’ve got the text that is taking place. The person is now speaking verbally to the AI. The AI has to digitally analyze the audio and use it as part of the total semblance of what is taking place. The pictures are being digitally analyzed to discern how they relate to the discussion underway. Plus, the live video is also getting analyzed.

It is quite a smorgasbord.

Think too about how complex this is becoming. A human is versed in mixing multimodal communication and can do so with relative ease. We do so constantly. Trying to devise AI to do this is a bit of a challenge.

As noted, the particularly tough problem is integrating or fusing together the multiple modes that are being captured in real-time. Remember that this is occurring live; thus, the AI must be processing the torrent of modes all at the same time and continue to be quick to interact with the user. People aren’t going to wait a few minutes for the AI to figure out what is occurring. The expectation is that the AI should be as speedy as interacting with a fellow human.

Multimodal As Output

The scenario that we’ve been exploring has concentrated on the use of multimodal inputs. A person using AI has been providing text, audio, images or photos, and live video. The AI has been responding via text only.

People would likely find that rather tiresome. If they are speaking to the AI, the AI ought to speak to them. There are, therefore, multimodal outputs that the AI needs to produce. The text and an audio expression should presumably match each other.

The simplest output would be for the AI to take the text and merely have the text read aloud to the user. This is a one-for-one correspondence. It’s easy. It’s cheap. Instead, we might have the AI say one thing in text and express the text in a different manner via audio output. Humans do this. I might text you a lengthy written message, and then when speaking with you, I will provide a quick summary or cover highlights.

The AI could do likewise.

From a perspective of the AI generating images and video, this can consist of several possibilities in a mental health context. The AI could showcase a diagram that the AI crafted to explain the various ways that depression often arises. A video showing an animated character could be used to elaborate on how depression can manifest in how a person walks and carries themselves.

All of those modes require multimodal fusion. There must be sensible alignment among the various modes of output. If the outputs are disjoint and misaligned, the user is going to get a confusing array of responses from the AI. It would be disconcerting and confusing.

There’s an intriguing new twist to AI-based “outputs” that we are gradually witnessing, consisting of the emergence of humanoid robots that look akin to the human form. I’ve noted that a humanoid robot doing chores in your home could also become a kind of walking-talking mental health therapist in your domicile. For more details, see my discussion at the link here and the link here.

User Communicating Distress

Now that I’ve laid out the fundamentals, let’s dig into some noteworthy specifics in the mental health sphere.

For example, what ought the AI be looking for if the text portion suggests the person is possibly experiencing high stress?

The human expressions of anxiety, sadness, withdrawal, agitation, shame, or confusion often show up in:

Voice tremors, pacing, sighing.
Facial micro-expressions.
Gaze direction or avoidance.
Posture and motor slowing.
Long pauses or rapid speech bursts.
Environmental context (e.g., messy room, staying in bed, darkness).

The AI can potentially detect those conditions via the text, audio, and video being expressed or provided by the user.

Does the sentiment in the text match the visual and audio cues?

If they aren’t matching, does this imply that the text isn’t telling the full story? Perhaps the dialogue should be shifted accordingly. The rest of the multimodal interaction suggests potential risk markers.

The AI needs to be cautious in overstating the audio and visual cues. Computationally leaping to false conclusions could be messy and undermine the dialogue with the user. Perhaps there are alternative explanations for what is being heard and seen, not necessarily tied to the discussion taking place.

Therapeutic alignment is crucial. The AI needs to figure out the alignment of the multimodal inputs. Likewise, the AI needs to present aligned multimodal outputs.

Longitudinal Assessments

Human therapists tend to get to know the body language and expressions of their clients.

This is usually determined over a series of therapeutic sessions. Each session provides an added indication of how the person carries themselves, how they react, and so on. An astute therapist comes to know and understand how a person is as an individual. It takes time to get this deeply identified.

An interesting angle is having AI do the same type of longitudinal analyses about people who are undertaking some form of mental health conversations.

Assume that a person has been using AI for mental health assistance for about two months. During that time, they have routinely used text, audio, and video when conversing with the AI. The AI is either recording the inputs or pattern matching on them in real-time and collecting the pattern matches into a repository or internal database.

By computationally analyzing the multimodal data and the fusions, the AI might pick up on these longitudinal changes that are taking place:

The person has dark circles under their eyes and has increasingly avoided eye contact with the camera, which might be a sign of depression, possibly exacerbated by disruptive sleep.
They often fidget with their hands when hiding their concerns; it’s a telling clue that can be used to direct the focus of the chat.
When under stress, their speaking becomes very fast-paced, and the words tend to get scrambled (be on the watch for this as an indicator).

Bottom Line Is Good

Let’s do a quick recap and finish up this discussion for the time being.

Moving beyond the standard mode of text-based chatting about mental health will be a boon for improving how AI can be of assistance in diagnosing and aiding people therapeutically. I will be covering the latest advances as they progress. Stay tuned

Multimodal fusion is vital and allows mental-health AI to:

Computationally gauge emotional nuance better.
Improve safety assessments.
Offer more attuned and empathetic-appearing responses.
Leverage visual and auditory cues for skill-building.
Provide contextual support.
Align better with human therapeutic communication norms.

This isn’t going to be easy and won’t be guaranteed. The chances of the AI doing a lousy job on the alignment and fusion of multimodal captures are currently a notable risk factor. Errors can occur. AI hallucinations can arise. For the time being, the single-modal efforts of text are a lot less complicated.

I am reminded of the famous line by Henry Wadsworth Longfellow: “Each morning sees some task begun, each evening sees it close; Something attempted, something done, has earned a night’s repose.” The path to multimodal fusion in mental health AI is on its way, and each day that passes will get us closer to getting the whole kit-and-kaboodle properly figured out.

What's On

Slim Design Gives Samsung ‘Advantage’

The $18 expense report and the defunded intern programs: symbols of corporate America’s dysfunction

Wednesday, June 3 Clues And Answers

AI And Mental Health

Background On AI For Mental Health

The Usual Singular Mode Of Text

Multimodal Fusion At The Input

More Multimodal Fusion At Input

Multimodal As Output

User Communicating Distress

Longitudinal Assessments

Bottom Line Is Good

Slim Design Gives Samsung ‘Advantage’

Wednesday, June 3 Clues And Answers

Trump’s Most Favored Nation Drug Pricing Has Bold Aims, But Limited Impact

June 3 Clues And Answers (#1,088)

From Service Provider To Strategic Partner

Why Enterprises Must Shift To Systems Of Action

Unwrap Christmas Sustainably: How To Handle Gifts You Don’t Want

Exclusive: DeFi platform Azura launches after raising $6.9 million from Initialized

Sam Altman’s World Wants To Scan Your Eyes To Prove You’re Human

Trump’s Most Favored Nation Drug Pricing Has Bold Aims, But Limited Impact

AI may already be adding billions to the economy—without showing up in the data

June 3 Clues And Answers (#1,088)

Should you treat AI agents as colleagues? Fortune 500 executives can’t settle the debate

Our Picks

Slim Design Gives Samsung ‘Advantage’

The $18 expense report and the defunded intern programs: symbols of corporate America’s dysfunction

Wednesday, June 3 Clues And Answers

Most Popular

Victoria’s Secret CEO rejected ‘woke-washing’ and endless sales cycles—and it’s paying off

Trump’s Most Favored Nation Drug Pricing Has Bold Aims, But Limited Impact

AI may already be adding billions to the economy—without showing up in the data

Archives

Categories

What's On

Multimodal Fusion Used In Self-Driving Cars Is Uplifting AI That Provides Mental Health Guidance

AI And Mental Health

Background On AI For Mental Health

The Usual Singular Mode Of Text

Multimodal Fusion At The Input

More Multimodal Fusion At Input

Multimodal As Output

User Communicating Distress

Longitudinal Assessments

Bottom Line Is Good

Related Articles