10th Indian Delegation to Dubai, Gitex & Expand North Star – World’s Largest Startup Investor Connect
Tech

The Emergence of Generalist Multimodal AI Models


Interest in multimodal large language models (MLLMs) has exploded over the last year or so thanks to their versatile capabilities in tackling tasks across multiple varieties of data — text, images and videos, as well as time series and graph data.

Because MLLMs are designed to learn, reason and adapt their behavior based on contextual information — much like how human intelligence works — some experts also believe that further developing multimodal AI is a crucial step toward artificial general intelligence (AGI).

It’s because of this potential downstream impact of multimodal AI that there is now more attention on building truly “generalist” multimodal AI models. Such generalist multimodal models (GMMs) would be able to learn easily across diverse modalities, and adapt and perform well when confronted with different types of tasks.

Current examples of generalist multimodal AI models include:

Foundation Models Paving the Way

This current trajectory toward generalist multimodal models has its roots in the development of pre-trained, deep-learning foundation models for processing natural language, vision, times series and graph-structured data.

Most notably, the 2018 introduction of foundation language models (FLMs) like BERT (Bidirectional Encoder Representations from Transformers) was pivotal in establishing the groundwork for models that could be pre-trained on massive text-based datasets using an attention-based architecture. These transformer models eventually paved the way for later large language models, like OpenAI’s GPT series.

Similarly, foundation vision models (FVMs) like the vision transformer (ViT) and vision-language alignment models like CLIP and LLaVA helped to push forward the cross-modal capabilities of multimodal AI models.

While foundational models in language and vision have progressed quickly, efforts in developing foundation time series models (FTMs) and foundation graph models (FGMs) have been progressing more slowly due to the specificity of such models and their limited transferability between distinct datasets.

Nevertheless, the capabilities of time series models like Informer and TimeGPT, plus graph neural networks (GNNs) like GROVER, could potentially be translated over into generalist multimodal models — thus allowing GMMs to easily make future predictions based on historical, time-stamped data (i.e., time series forecasting), or to analyze various sets of entities and their mutual interactions (i.e., graph data).

Typical Model Pipeline

According to a recent survey from the Pacific Northwest National Lab examining the development of GMMs, a multimodal model with generalist capabilities would typically have the following components:

  • Input data pre-processor;
  • Universal learning module (encoder, decoder); and
  • Output data post-processor.

Via Munikoti, et al., “Generalist Multimodal AI: A Review of Architectures, Challenges and Opportunities”

Raw data of different modalities is pre-processed with the input data pre-processor, converting it into a form that can then be used by the universal learning module. This can be achieved via serialization or tokenization, where text, audio or images are transformed into a numerical “token” format, so that it can be fed into the encoder of the universal learning module — which functions like a “backbone” for learning and reasoning.

The encoder converts the input token into a representational embedding that is positioned in a high-dimensional semantic space for universal learning. For instance, text-based data could be handled by any LLM, while images could be encoded by models like CLIP, or various modalities by multimodal models like ImageBind.

Additionally, a projector might be needed to transform or “project” representational embeddings from the encoder into something that can be understood by the universal learning module.

The decoder then transforms the multimodal representational embeddings into outputs that are relevant to the task, and are informed by the cross-modal context gleaned from the previous steps.

Challenges

While the field of generalist multimodal AI is continuing to expand, there are still some potential issues to consider.

These include the shortage of multimodal datasets, relative to the abundance of unimodal, text-based and image-based datasets. This is due to cost and the legitimate concerns over data privacy, as well as the considerable computational and labor expense in generating truly comprehensive multimodal datasets that would accurately match massive amounts of text data with audio and image data (for example).

Other hurdles include the lack of sufficiently complex benchmarks to evaluate GMMs, beyond the usual ones that are geared primarily toward text and images.

Another barrier is that current multimodal learning is heavily skewed toward cross-modal learning, which often favors image and text over other modalities. More research is needed to explore and innovate into capturing under-represented modalities — like thermal information in infrared images — which can then be leveraged to further develop generalist multimodal AI models for medical applications.

Despite these challenges, the process of further developing truly generalist multimodal AI is a crucial undertaking, especially with the prospect of establishing the necessary groundwork for AGI.


Group Created with Sketch.





Source link

by Siliconluxembourg

Would-be entrepreneurs have an extra helping hand from Luxembourg’s Chamber of Commerce, which has published a new practical guide. ‘Developing your business: actions to take and mistakes to avoid’, was written to respond to  the needs and answer the common questions of entrepreneurs.  “Testimonials, practical tools, expert insights and presentations from key players in our ecosystem have been brought together to create a comprehensive toolkit that you can consult at any stage of your journey,” the introduction… Source link

by WIRED

B&H Photo is one of our favorite places to shop for camera gear. If you’re ever in New York, head to the store to check out the giant overhead conveyor belt system that brings your purchase from the upper floors to the registers downstairs (yes, seriously, here’s a video). Fortunately B&H Photo’s website is here for the rest of us with some good deals on photo gear we love. Save on the Latest Gear at B&H Photo B&H Photo has plenty of great deals, including Nikon’s brand-new Z6III full-frame… Source link

by Gizmodo

Long before Edgar Wright’s The Running Man hits theaters this week, the director of Shaun of the Dead and Hot Fuzz had been thinking about making it. He read the original 1982 novel by Stephen King (under his pseudonym Richard Bachman) as a boy and excitedly went to theaters in 1987 to see the film version, starring Arnold Schwarzenegger. Wright enjoyed the adaptation but was a little let down by just how different it was from the novel. Years later, after he’d become a successful… Source link