Multimodality is set to redefine how enterprises leverage AI in 2025. Imagine an AI that understands not just text but also images, audio, and other sensor data. Humans are naturally multimodal. However, humans are limited in how much input we can process. Take healthcare as an example, during my time at Google Health, I heard many stories where patients overwhelmed doctors with data:

Imagine a patient with atrial fibrillation (AFIB) showing up with five years of detailed sleep data collected from their smartwatch. Or take the cancer patient arriving with a 20-pound stack of medical records documenting every treatment they’ve had. Both of these situations are very real. For doctors, the challenge is the same: separating the signal from the noise.

What’s needed is an AI that can summarize and highlight the key points. Large language models, like ChatGPT, already do this with text, pulling out the most relevant information. But what if we could teach AI to do the same with other types of data — like images, time series, or lab results?

How Does Multimodality AI Work?

To understand how multimodality works, let’s start with the fact that AI needs data both to be trained and to make predictions. Multimodal AI is designed to handle diverse data sources — text, images, audio, video, and even time-series data — at the same time. By combining these inputs, multimodal AI offers a richer, more comprehensive understanding of the problems it tackles.

Multimodal AI is more of a discovery tool. The different data modalities are stored by the AI. Once a new data point is input, the AI finds topics that are close. For example, by inputting the sleep data from someone’s smartwatch alongside information about their atrial fibrillation (AFIB) episodes, the doctor might find indications of sleep apnea.

Note that this is based on “closeness,” not correlation. It is the scaled-up version of what Amazon once popularized: “people who shopped for this item also bought this item.” In this case, it’s more like: “People with this type of sleep pattern have also been diagnosed with AFIB.”

Multimodal Explained: Encoders, Fusion and Decoders

A multimodal AI system consists of three main components: Encoders, Fusion and Decoders.

Encoding Any Modality

Encoders convert raw data (e.g., text, images, sound, log files, etc.) into a representation the AI can work with. These are called vectors, which are stored in a latent space. To simplify, think of this process as storing an item in a warehouse (latent space), where each item has a specific location (vector). Encoders can process virtually anything: images, text, sound, videos, log files, IoT (sensor) information, time series — you name it.

Fusion Mechanism: Combining Modalities

When working with one type of data, like images, encoding is enough. But with multiple types — images, sounds, text, or time-series data — we need to fuse the information to find what’s most relevant.

Decoders: Generating Outputs We Understand

Decoders “decodes” the information from the latent space — aka the warehouse — and deliver it to us. It moves from raw, abstract information to something we understand. For example, finding an image of a “house.”

If you want to learn more about encoding, decoding, and reranking, join my eCornell Online Certificate course on “Designing and Building AI Solutions.” It’s a no-coding program that explores all aspects of AI solutions.

Transforming eCommerce with Multimodality

Let’s look at another example: eCommerce. Amazon’s interface hasn’t changed much in 25 years — you type a keyword, scroll through results, and hope to find what you need. Multimodality can transform this experience by letting you describe a product, upload a photo, or provide context to find your perfect match.

Fixing Search with Multimodal AI

At r2decide, a company a few Cornellians and I started, we’re using multimodality to merge Search, Browse, and Chat into one seamless flow. Our customers are eCommerce companies tired of losing revenue because their users couldn’t find what they needed. At the core of our solution is multimodal AI.

For example, in an online jewelry store, a user searching for “green” would — in the past — only see green jewelry if the word “green” appeared in the product text. Since r2decide’s AI also encodes images into a shared latent space (e.g., warehouse), it finds “green” across all modalities. The items are then re-ranked based on the user’s past searches and clicks to ensure they receive the most relevant “green” options.

Users can also search for broader contexts, like “wedding,” “red dress,” or “gothic.” The AI encodes these inputs into the latent space, matches them with suitable products, and displays the most relevant results. This capability even extends to brand names like “Swarovski,” surfacing relevant items — even if the shop doesn’t officially carry Swarovski products.

AI-Generated Nudges to Give Chat-Like Advice

Alongside search results, R2Decide also generates AI-driven nudges — contextual recommendations or prompts designed to enhance the user experience. These nudges are powered by AI agents, as I described in my post on agentic AI yesterday. Their purpose is to guide users effortlessly toward the most relevant options, making the search process intuitive, engaging, and effective.

Multimodality in 2025: Infinite Possibilities for Enterprises

Multimodality is transforming industries, from healthcare to eCommerce. And it doesn’t stop there. Startups like TC Labs use multimodal AI to streamline engineering workflows, boosting efficiency and quality, while Toyota uses it for interactive, personalized customer assistance.

2025 will be the year multimodal AI transforms how enterprises work. Follow me here on Forbes, or on LinkedIn for more of my 2025 AI predictions.

Share.
Exit mobile version