Here we go again! Multimodal AI is the “next big thing” in artificial intelligence technology. But what exactly does multimodal mean, and how does it differ from the AI models we’ve all grown familiar with?
Let’s take a look.
Multimodal AI refers to artificial intelligence systems that can process multiple data inputs to produce more accurate, sophisticated outputs than unimodal systems.
An example of a multimodal AI system is OpenAi’s GPT-4V(ision). The major difference in this V version over GPT-4? Vision can process image inputs in addition to text. Other examples include Runway Gen-2 for video generation and Inworld AI for character creation in games and digital worlds.
Today, multimodal AI is mostly wrapped up in its potential to do a lot of things, as we’ll see below. Beware, though, that multimodal AI is far from being fully figured out.
(Know other AIs: adaptive AI, generative AI & what generative AI means for cybersecurity.)
With regard to artificial intelligence, modality refers to data types. Data modalities include — but are not limited to — text, images, audio, and video.
Many generative artificial intelligence systems can only process one type of input — like text — and only provide output in that data modality. This makes it unimodal.
Multimodal AI gives users the ability to provide multiple data modalities and generate outputs with those modalities. For example, if you give a multimodal system both text and images, it can produce both text and images.
Multimodal AI systems are trained to identify patterns between different types of data inputs. These systems have three primary elements:
Remember how we discussed modality? A multimodal AI system actually consists of many unimodal neural networks. These make up the input module, which receives multiple data types.
Then there’s the fusion module, which combines, aligns, and processes the data from each modality. Fusion employs various techniques, such as early fusion (concatenating raw data). Finally, the output module serves up results. These vary greatly depending on the original input.
One of the major upsides of multimodal AI models is context. Because these systems can recognize patterns and connections between different types of data inputs, the output is more accurate, natural, intuitive, and informative. And, of course, it’s more human.
Multimodal AI can also solve a wider variety of problems than unimodal systems — more on the possibilities below.
As with any new technology, multimodal AI comes with several downsides, including…
Multimodal AIs would require large amounts of diverse data for it to be trained effectively. Collecting and labeling these data is expensive and time-consuming.
Multiple modalities display various kinds and intensities of noise at various times, and they aren't necessarily temporally aligned. The diverse nature of multimodal data makes the effective fusion of many modalities difficult, too.
It’s challenging to properly align relevant data representing the same time and space when diverse data types (modalities) are involved.
Translation of content across many modalities, either between distinct modalities or from one language to another, is a complex undertaking known as multimodal translation. An example of this translation is asking an AI system to create an image based on a text description.
One of the biggest challenges of multimodal translation is making sure the model can comprehend the semantic information and connections between text, audio, and images. It's also difficult to create representations that effectively capture such multimodal data.
Managing various noise levels, missing data, and merging data from many modalities are some of the difficulties that come with multimodal representation.
As with all artificial intelligence technology, there are several legitimate concerns surrounding ethics and user privacy.
Because AI is created by people — people with biases — AI bias is a given. This may lead to discriminatory outputs related to gender, sexuality, religion, race, and more.
What’s more, AI relies on data to train its algorithms. This data can include sensitive, personal information. This raises legitimate concerns about the security of social security numbers, names, addresses, financial information, and more.
(Related reading: AI ethics, data privacy & AI governance.)
Multimodal AI is an exciting development, but it has a long way to go. Even still, the possibilities are nearly endless. A few ways multimodal AI may be used include:
Between the challenges of executing these complex tasks and the legitimate privacy and ethical concerns raised by experts, it may be quite some time before multimodal AI systems are incorporated into our daily lives — be on the lookout for further developments.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.