200 likes | 251 Views
Multimodal models are AI systems that can process and understand multiple forms of data, such as text, images, audio, and video. This allows them to make more informed decisions and generate more accurate results than unimodal models, which can only process a single form of data.<br><br>
E N D
Multimodal Model leewayhertz.com/multimodal-model In today’s technology-driven world, artificial intelligence (AI) has become a vital catalyst driving growth for numerous industries, revolutionizing the way we live and work. One of AI’s latest and most exciting advancements is multimodality – a new cognitive AI frontier combining multiple sensory input forms to make more informed decisions. Multimodal AI came into the picture in 2022 and its possibilities are expanding ever since, with efforts to align text/NLP and vision in an embedding space to facilitate decision-making. AI models must possess a certain level of multimodal capability to handle certain tasks, such as recognizing emotions and identifying objects in images, making multimodal systems the future of AI. Although in its early stages, multimodal AI has already surpassed the performance of humans in many tests. This is significant because AI is already a part of our daily lives, and such advancements have implications for various industries and sectors. Multimodal AI aims to replicate how the human brain functions, utilizing an encoder, input/output mixer, and decoder. This approach enables machine learning systems to handle tasks that involve images, text, or both. By connecting different sensory inputs with related concepts, these models can integrate multiple modalities, allowing for more comprehensive and nuanced problem-solving. Hence, the first crucial step in developing multimodal AI is aligning the internal representation of the model across all modalities. By incorporating various forms of sensory input, including text, images, audio, and video, these systems can solve complex problems that were once difficult to tackle. As this technology continues to gain momentum, many organizations are adopting it to enhance 1/20
their operations. Furthermore, the cost of developing a multimodal model is not prohibitively expensive, making it accessible to a wide range of businesses. This article will delve deep into what a multimodal model is and how it works. What is a multimodal model? Multimodal vs. Unimodal AI models How does the multimodal model work? Benefits of a multimodal model Multimodal AI use cases How to build a multimodal model? What is a multimodal model? A multimodal model is an AI system designed to simultaneously process multiple forms of sensory input, similar to how humans experience the world. Unlike traditional unimodal AI systems, trained to perform a specific task using a single sample of data, multimodal models are trained to integrate and analyze data from various sources, including text, images, audio, and video. This approach allows for a more comprehensive and nuanced understanding of the data, as it incorporates the context and supporting information essential for making accurate predictions. The learning in a multimodal model involves combining disjointed data collected from different sensors and data inputs into a single model, resulting in more dynamic predictions than in unimodal systems. Using multiple sensors to observe the same data provides a more complete picture, leading to more intelligent insights. To achieve this, multimodal models rely on deep learning techniques that involve complex neural networks. The model’s encoder layer transforms raw sensory input into a high- level abstract representation that can be analyzed and compared across modalities. The input/output mixer layer combines information from the various sensory inputs to generate a comprehensive representation of the data, while the decoder layer generates an output that represents the predicted result based on the input. Multimodal models can potentially revolutionize how AI systems operate by mimicking how the human brain integrates multiple forms of sensory input to understand the world. With the ability to process and analyze multiple data sources simultaneously, the multimodal model offers a more advanced and intelligent approach to problem-solving, making it a promising area for future development. Multimodal vs. Unimodal AI models 2/20
The multimodal and unimodal models represent two different approaches to developing artificial intelligence systems. While the unimodal model focuses on training systems to perform a single task using a single data source, the multimodal model seeks to integrate multiple data sources to analyze a given problem comprehensively. Here is a detailed comparison of the two approaches: Scope of data: Unimodal AI systems are designed to process a single data type, such as images, text, or audio. In contrast, multimodal AI systems are designed to integrate multiple data sources, including images, text, audio, and video. Complexity: Unimodal AI systems are generally less complex than multimodal AI systems since they only need to process one type of data. On the other hand, multimodal AI systems require a more complex architecture to integrate and analyze multiple data sources simultaneously. Context: Since unimodal AI systems focus on processing a single type of data, they lack the context and supporting information that can be crucial in making accurate predictions. Multimodal AI systems integrate data from multiple sources and can provide more context and supporting information, leading to more accurate predictions. Performance: While unimodal AI systems can perform well on tasks related to their specific domain, they may struggle when dealing with tasks requiring a broader context understanding. Multimodal AI systems integrate multiple data sources and can offer more comprehensive and nuanced analysis, leading to more accurate predictions. Data requirements: Unimodal AI systems require large amounts of data to be trained effectively since they rely on a single type of data. In contrast, multimodal AI systems can be trained with smaller amounts of data, as they integrate data from multiple sources, resulting in a more robust and adaptable system. 3/20
Technical complexity: Multimodal AI systems require a more complex architecture to integrate and analyze multiple sources of data simultaneously. This added complexity requires more technical expertise and resources to develop and maintain than unimodal AI systems. Here is a comparison table between multimodal and unimodal AI Launch your project with LeewayHertz Upgrade to multimodal models for unparalleled precision in data classification, prediction making and response generation. Learn More Metric Multimodal AI Well-suited for tasks that require an understanding of multiple types of input, such as image captioning and video captioning Unimodal AI Applications Well-suited for tasks that involve a single type of input, such as sentiment analysis and speech recognition Computational Resources Typically requires more computational resources due to the increased complexity of processing multiple modalities Generally requires less computational resources due to the simpler processing of a single modality Data Input Uses multiple types of data input (e.g. text, images, audio) Uses a single type of data input (e.g. text only, images only) Examples of AI Models CLIP, DALL-E, METER, ALIGN, SwinBERT BERT, ResNet, GPT-3, YOLO Information Processing Processes multiple modalities simultaneously, allowing for a richer understanding and improved accuracy Processes a single modality, limiting the depth and accuracy of understanding Natural Interaction Enables natural interaction through multiple modes of communication, such as voice commands and gestures Limits interaction to a single mode, such as text input or button clicks How does the multimodal model work? 4/20
Multimodal AI combines multiple data sources from different modalities, such as text, images, audio, and video. The process starts with individual unimodal neural networks trained on their respective input data, using convolutional neural networks for images and recurrent neural networks for text. The output of these networks is a set of features that capture the essential characteristics of the input. To accomplish this, they are composed of three main components, starting with the unimodal encoders. These encoders are responsible for processing the input data from each modality separately. For instance, an image encoder would process an image, while a text encoder would process text. Once the unimodal encoders have processed the input data, the next component of the architecture is the fusion network. The fusion network’s primary role is to combine the features extracted by the unimodal encoders from the different modalities into a single representation. This step is critical in achieving success in these models. Various fusion techniques, such as concatenation, attention mechanisms, and cross-modal interactions, have been proposed for this purpose. Finally, the last component of the multimodal architecture is the classifier, which makes predictions based on the fused data. The classifier’s role is to classify the fused representation into a specific output category or make a prediction based on the input. The classifier is trained on the specific task and is responsible for making the final decision. One of the benefits of multimodal architectures is their modularity which allows for flexibility in combining different modalities and adapting to new tasks and inputs. By combining information from multiple modalities, a multimodal model offers more dynamic predictions and better performance compared to unimodal AI systems. For instance, a model that can process both audio and visual data can better understand speech than a model that only processes audio data. 5/20
Now let’s discuss how a multimodal AI model and architecture work for different types of inputs with real-life examples: Image description generation and text-to-image generation OpenAI’s CLIP, DALL-E, and GLIDE are powerful computer programs that can help us describe images and generate images from text. They are like super-intelligent robots that can analyze pictures and words and then use that information to create something new. CLIP utilizes separate image and text encoders, pre-trained on large datasets, to predict which images in a dataset are associated with different descriptions. What’s fascinating is that CLIP features multimodal neurons that activate when exposed to both the text description and the corresponding image, indicating a fused multimodal representation. On the other hand, DALL-E is a massive 13 billion parameter variant of GPT-3 that generates a series of output images to match the input text and ranks them using CLIP. The generated images are remarkably accurate and detailed, pushing the limits of what was once thought possible with text-to-image generation. GLIDE, the successor to DALL-E, still relies on CLIP to rank the generated images, but the image generation process is done using a diffusion model, allowing for even more creative and realistic images to be generated, leading to stunning results. These models showcase the incredible potential of the multimodal model and demonstrate how it can be used to generate high-quality, coherent descriptions and images from text and images. They have set a new standard for image description generation and text-to-image generation, and it will be exciting to see what further advancements are made in this field. Visual question answering Visual question answering is a challenging task requiring a model to answer a question based on an image correctly. Microsoft Research has developed some of the top- performing approaches for question answering. One such approach is METER, a general framework for training high-performing end-to-end vision-language transformers. This framework uses a variety of sub-architectures for the vision encoder, text encoder, multimodal fusion, and decoder modules. Another approach is the Unified Vision-Language Pretrained Model (VLMo), which jointly utilizes a modular transformer network to learn a dual encoder and a fusion encoder. The network consists of blocks with modality-specific experts and a shared self-attention layer, providing significant flexibility for fine-tuning. These models demonstrate the power of multimodal AI in combining vision and language to answer complex questions. With continued research and development, the possibilities for these models are endless. 6/20
Text-to-image and image-to-text search Multimodal learning has important applications in web search, as demonstrated by the WebQA dataset created by experts from Microsoft and Carnegie Mellon University. In this task, a model must identify image and text-based sources that can aid in answering a query. However, multiple sources are required for most questions to arrive at the correct answer. The model must then reason using these sources to produce a natural language answer for the query. Google has also tackled the challenge of multimodal search using their A Large-scale ImaGe and Noisy-Text Embedding model (ALIGN). ALIGN leverages the easily accessible but noisy alt-text data connected with internet images to train distinct visual (EfficientNet-L2) and text (BERT-Large) encoders. These encoders’ outputs are then fused using contrastive learning, resulting in a model with multimodal representations that can power cross-modal search without the need for further fine-tuning. Video-language modeling AI systems historically faced challenges with video-based tasks due to the resource- intensive nature of such tasks. However, recent advancements in video-related multimodal tasks are making significant progress. Microsoft’s Project Florence-VL is a major effort in this field, and its introduction of ClipBERT in mid-2021 marks a major breakthrough. ClipBERT uses a combination of CNN and transformer models that operate on sparsely sampled frames, optimized in an end-to-end fashion to solve popular video- language tasks. Evolutions of ClipBERT, including VIOLET and SwinBERT, utilize Masked Visual-token Modeling and Sparse Attention to achieve SotA in video question answering, retrieval and captioning. Despite their differences, all of these models share the same transformer-based architecture, combined with parallel learning modules to extract data from different modalities and unify them into a single multimodal representation. Here is a table comparing some of the popular multimodal models Model Name Modality Key Features Applications CLIP Vision and Text Pre-trained image and text encoders, multimodal neurons Image captioning, text-to- image generation DALL-E Text and Image 13B parameter GPT-3 variant generates images from text Text-to-image generation, image manipulation GLIDE Text and Image Uses diffusion model for image generation, ranks images Text-to-image generation, image manipulation METER Vision and Text A general framework for vision-language transformers Visual question answering 7/20
VLMo Vision and Text Modular transformer network with shared self- attention Visual question answering ALIGN Vision and Text Exploits noisy alt-text data to train separate encoders Multimodal search ClipBERT Video and Text Combination of CNN and transformer model, end-to- end Video-language tasks: question answering, retrieval, captioning VIOLET Video and Text Masked Visual-token Modeling, Sparse Attention Video question answering, video retrieval SwinBERT Video and Text Uses Swin Transformer architecture for video tasks Video question answering, video retrieval, video captioning Launch your project with LeewayHertz Upgrade to multimodal models for unparalleled precision in data classification, prediction making and response generation. Learn More Benefits of a multimodal model Contextual understanding One of the most significant benefits of multimodal models is their ability to achieve contextual understanding, which is the ability of a system to comprehend the meaning of a sentence or phrase based on the surrounding words and concepts. In natural language processing (NLP), understanding the context of a sentence is essential for accurate language comprehension and generating appropriate responses. In the case of multimodal AI, the model can combine visual and linguistic information to gain a more comprehensive understanding of the context. By integrating multiple modalities, multimodal models can consider both the visual and textual cues in a given context. For example, in image captioning, the model must be able to interpret the visual information presented in the image and combine it with the linguistic information in the caption to produce an accurate and meaningful description. In video captioning, the model needs to be able to understand not just the visual information but also the temporal relationships between events, sounds, and dialogue. Another example of the benefits of contextual understanding in multimodal AI is in the field of natural language dialogue systems. Multimodal models can use visual and linguistic cues to generate more human-like conversation responses. For instance, a 8/20
chatbot that understands the visual context of the conversation can generate more appropriate responses that consider not just the text of the conversation but also the images or videos being shared. Natural interaction Another key benefit of multimodal models is their ability to facilitate natural interaction between humans and machines. Traditional AI systems have been limited in interacting with humans naturally since they typically rely on a single mode of input, such as text or speech. However, multimodal models can combine multiple input modes, such as speech, text, and visual cues, to comprehensively understand a user’s intentions and needs. For example, a multimodal AI system in a virtual assistant could use speech recognition to understand a user’s voice commands but also incorporate information from the user’s facial expressions and gestures to determine their level of interest or engagement. By taking these additional cues into account, the system can provide a more tailored and engaging experience for the user. Multimodal models can also enable more natural language processing, allowing users to interact with machines in a more conversational manner. For example, a chatbot could use natural language understanding to interpret a user’s text message and incorporate visual cues from emojis or images to better understand the user’s tone and emotions, which results in a more nuanced and effective response from the chatbot. Improved accuracy Multimodal models offer a significant advantage in terms of improved accuracy by integrating various modalities such as text, speech, images, and video. These models have the ability to capture a more comprehensive and nuanced understanding of the input data, resulting in more accurate predictions and better performance across a wide range of tasks. Multiple modalities allow multimodal models to generate more precise and descriptive captions in tasks like image captioning. They can also enhance natural language processing tasks such as sentiment analysis by incorporating speech or facial expressions to gain more accurate insights into the speaker’s emotional state. Additionally, multimodal models are more resilient to incomplete or noisy data, as they can fill in the gaps or rectify errors by utilizing information from multiple modalities. For instance, a multimodal model that incorporates lip movements can enhance speech recognition accuracy in noisy environments, enabling clarity where audio input may be unclear. Improved capabilities 9/20
The ability of multimodal models to enhance the overall capabilities of AI systems is a significant advantage where these models can leverage data from multiple modalities, including text, images, audio, and video, to better understand the world and the context in which it operates, enabling AI systems to perform a broader range of tasks with greater accuracy and effectiveness. For instance, by combining speech and facial recognition, a multimodal model could create a more robust system for identifying individuals. At the same time, analyzing both visual and audio cues would improve accuracy in differentiating between individuals with similar appearances or voices, while contextual information such as the environment or behavior would provide a better understanding of the situation, leading to more informed decisions. Multimodal models can also enable more natural and intuitive interactions with technology, making it easier for users to interact with AI systems. By combining modalities such as voice and gesture recognition, a system can understand and respond to more complex commands and queries, leading to more efficient and effective use of technology and improved user satisfaction. Multimodal AI use cases The potential of multimodal AI knows no bounds when it comes to revolutionizing industries. Forward-thinking companies and organizations have taken notice of its incredible capabilities and are incorporating it into their digital transformation agendas. Here are some of the use cases of multimodal AI: Automotive industry The automotive industry has been an early adopter of multimodal AI which is leveraging this technology to enhance safety, convenience, and overall driving experience. In recent years, the automotive industry has made significant strides in integrating multimodal AI into driver assistance systems, HMI (human-machine interface) assistants, and driver monitoring systems. To explain more, driver assistance systems in modern cars are no longer limited to basic cruise control and lane detection; instead, multimodal AI systems have enabled more sophisticated systems like automatic emergency braking, adaptive cruise control, and blind spot monitoring. These systems also leverage modalities like visual recognition, radar, and ultrasonic sensors to detect and respond to different driving situations. Multimodal AI has also improved modern vehicles’ HMI (human-machine interface). By enabling voice and gesture recognition, drivers can interact with their vehicles more intuitively and naturally. For example, drivers can use voice commands to adjust the temperature, change the music, or make a phone call without taking their hands off the steering wheel. 10/20
Furthermore, driver monitoring systems powered by multimodal AI can detect driver fatigue, drowsiness, and inattention by using multiple modalities like facial recognition, eye-tracking, and steering wheel movements to detect signs of drowsiness or distraction. If a driver is detected as inattentive or drowsy, the system can issue a warning or even take control of the vehicle to avoid accidents. The possibilities of multimodal interaction with vehicles are endless, which implies there will soon be a time when communicating with a car through voice, image, or action will be the norm. With multimodal AI, we can look forward to a future where our vehicles are more than just machines that take us from point A to B. They will become our trusted companions, with whom we can interact naturally and intuitively, making driving safer, more comfortable, and more enjoyable. Healthcare and pharma Multimodal AI has enormous potential to transform the healthcare and pharmaceutical industries by enabling faster and more accurate diagnoses, personalized treatment plans, and better patient outcomes. With the ability to analyze multiple data modalities, including image data, symptoms, background, and patient histories, multimodal AI can help healthcare professionals make informed decisions quickly and efficiently. In healthcare, multimodal AI can help physicians diagnose complex diseases by analyzing medical images such as MRI, CT scans, and X-rays. These modalities, when combined with clinical data and patient histories, provide a more comprehensive understanding of the disease and help physicians make accurate diagnoses. For example, in the case of cancer treatment, multimodal AI can help identify the type of cancer, its stage, and the best course of treatment. Multimodal AI can also help improve patient outcomes by providing personalized treatment plans based on a patient’s medical history, genetic data, and other health parameters, which help physicians tailor treatments to individual patients, resulting in better outcomes and reduced healthcare costs. In the pharmaceutical industry, multimodal AI can help accelerate drug development by identifying new drug targets and predicting the efficacy of potential drug candidates. By analyzing large amounts of data from multiple sources, including clinical trials, electronic health records, and genetic data, multimodal AI can identify patterns and relationships that may not be apparent to human researchers, helping pharmaceutical companies to identify promising drug candidates and bring new treatments to market more quickly. Media and entertainment Multimodal AI has revolutionized the media and entertainment industry by enabling personalized content recommendations, targeted advertising, and efficient remarketing. By analyzing multiple data modalities, including user preferences, viewing history, and behavioral data, multimodal AI can deliver tailored content and advertising experiences to each individual user. 11/20
One of the most significant ways in which multimodal AI is used in the media and entertainment industry is through recommendation systems where it can provide personalized recommendations for movies, TV shows, and other content by analyzing user data, including viewing history and preferences, helping users discover new content that they may not have otherwise found, leading to increased engagement and satisfaction. Multimodal AI is also used to deliver personalized advertising experiences. By analyzing user data, including browsing history, search history, and social media interactions, multimodal AI can create targeted advertising campaigns that are more likely to resonate with individual users. This can lead to higher click-through rates and conversions for advertisers, while providing a more relevant and engaging experience for users. In addition, multimodal AI is used for efficient remarketing. By analyzing user behavior and preferences, multimodal AI can create personalized messaging and promotions that are more likely to convert users who have abandoned their shopping carts or canceled subscriptions. This can help media and entertainment companies recover lost revenue while providing a better user experience. Retail Multimodal AI has transformed the retail industry by enabling personalized customer experiences, improved supply chain management, and enhanced product recommendations. By analyzing multiple data modalities, including customer behavior, preferences, and purchase history, multimodal AI can provide retailers with valuable insights that can be used to optimize their operations and drive revenue growth. One of the most significant ways in which multimodal AI is used in retail is for customer profiling, in which analyzing customer data from various sources, including online behavior, in-store purchases, and social media interactions, multimodal AI creates a detailed profile of each customer, including their preferences, purchase history, and shopping habits. This information can be used to create personalized product recommendations, targeted marketing campaigns, and customized promotions that are more likely to resonate with individual customers. Multimodal AI is also used for supply chain optimization where by analyzing data from various sources, including production, transportation, and inventory management, multimodal AI can help retailers optimize their supply chain operations, reducing costs, and improving efficiency, leading to faster delivery times, reduced waste, and improved customer satisfaction. In addition, multimodal AI is used for personalized product recommendations. By analyzing customer data, including purchase history, browsing behavior, and preferences, multimodal AI can provide customers with personalized product recommendations, making it easier for them to discover new products and make informed purchase decisions. This can lead to increased customer loyalty and higher revenue for retailers. 12/20
Manufacturing Multimodal AI is increasingly being used in manufacturing to improve efficiency, safety, and quality control as by integrating various input and output modes, such as voice, vision, and movement, manufacturing companies can leverage the power of multimodal AI to optimize their production processes and deliver better products to their customers. One of the key applications of multimodal AI in manufacturing is predictive maintenance, where machine learning algorithms are used to analyze sensor data from machines and equipment, helping manufacturers to identify potential issues before they occur and take proactive measures to prevent downtime and costly repairs. This can help improve equipment reliability, increase productivity, and reduce maintenance costs. Multimodal AI can also be used in quality control by integrating vision and image recognition technologies to identify product defects and anomalies during production. By leveraging computer vision algorithms and machine learning models, manufacturers can automatically detect defects and make real-time adjustments to their production processes to minimize waste and improve product quality. In addition, multimodal AI can be used to optimize supply chain management by integrating data from various sources, such as transportation systems, logistics networks, and inventory management systems, helping manufacturers improve their production planning and scheduling, reduce lead times, and improve overall supply chain efficiency. Launch your project with LeewayHertz Upgrade to multimodal models for unparalleled precision in data classification, prediction making and response generation. Learn More How to build a multimodal model? Take an image and add some text, you have got a meme. In this example, we will describe the basic steps to develop and train a multimodal model to detect a certain type of meme. To do this, we need to execute the undermentioned steps. Step 1: Import libraries To commence, we import commonly used data science libraries to load and manipulate data. Here is a sample code for the same. %matplotlib inline import json import logging from pathlib import Path 13/20
import random import tarfile import tempfile import warnings import matplotlib.pyplot as plt import numpy as np import pandas as pd import pandas_path # Path style access for pandas from tqdm import tqdm Furthermore, it’s imperative to introduce some of the tools provided by deep learning libraries, which will aid in data exploration before model development. Specifically, we will be utilizing PyTorch, an open-source deep learning framework from Facebook AI, along with a selection of other libraries within the PyTorch ecosystem to simplify the creation of a versatile multimodal model. This combination of tools promises to make the process of building a model easier than ever before. considering that memes involve multiple data modes, namely vision and language, it would be beneficial to have access to a variety of vision and language models. PyTorch’s torchvision is a must-have when working on computer vision tasks, as it provides access to a multitude of popular datasets, model architectures (with pretrained weights), and image transformations. On the other hand, Facebook AI’s fasttext is a useful tool for training embeddings on language data, making it a great starting point before delving into more advanced methods like transformers. To construct our multimodal classifier for identifying memes, we will incorporate a torchvision vision model for extracting features from meme images and a fasttext model for extracting features from their corresponding text. The resulting language and vision features will then be fused together using torch. With this in mind, let’s import these essential tools. import torch import torchvision import fasttext Step 2: Loading and exploring the data The next step is to load image data from the repository. 14/20
In order to prepare the images for training a model, it is necessary to resize them into tensor minibatches. The torchvision.transforms module provides useful tools to accomplish this task. A sequence of transformations can be applied to the images by utilizing the Compose object. For instance, we can resize the images using the Resize function (which may distort images if interpolation is required) and convert them into PyTorch tensors via ToTensor. After resizing the images to a uniform size, we can merge them into a single tensor object with torch.stack. The torchvision.utils.make_grid function can also be employed to visualize the images using Matplotlib easily. Here is a code sample for the same: # define a callable image_transform with Compose image_transform = torchvision.transforms.Compose( [ torchvision.transforms.Resize(size=(224, 224)), torchvision.transforms.ToTensor() ] ) # convert the images and prepare for visualization. tensor_img = torch.stack( [image_transform(image) for image in images] ) grid = torchvision.utils.make_grid(tensor_img) # plot plt.rcParams["figure.figsize"] = (20, 5) plt.axis('off') _ = plt.imshow(grid.permute(1, 2, 0)) Step 3: Building a multimodal model Now that we have an understanding of the data processing steps, we can move on to building the model. As with any machine learning project, there are three main areas to focus on: 1. Dataset handling 15/20
2. Model architecture 3. Training logic Each of these areas is important and can be complex in its own right, especially when dealing with multimodal problems like ours. We will touch on each of them and explore how they relate to our specific problem. Finally, we will combine all three areas in the model training phase. Our model requires the separate processing of transformed images and encoded text inputs for each dataset sample. This is achieved by accessing “image” and “text” data independently through the PyTorch Dataset class. Subclassing a Dataset requires defining its size by overriding len and how it returns a sample by overriding getitem. The Pandas DataFrame of json records is used to load images, subsample data, balance the training set, and return torch.tensors for model input. Samples are returned as dictionaries with keys for “id”, “image”, “text”, and “label” (if it exists). Now that we have a way of processing and organizing the meme data, we’ll be able to use the torch.utils.data.DataLoader to actually serve the data. Creating a multimodal model We’re going to implement a design called mid-level concat fusion. In the mid-level fusion technique using concatenation, each input data mode goes through its own module to extract features, which are then concatenated together. The resulting multimodal features are then fed into a classifier. 16/20
A sample class called LanguageAndVisionConcat performs mid-level fusion by concatenating the feature representations from the image and language modules, which are treated as parameters. The concatenated features are then passed through a fully connected layer for classification. The class does not modify the architectures of the language and vision modules, making it easy to swap them out. The forward method expects both text and image inputs. More sophisticated approaches for fusing data modes exist, but this simple concatenation approach will suffice for our purposes. class LanguageAndVisionConcat(torch.nn.Module): def __init__( self, num_classes, loss_fn, language_module, vision_module, language_feature_dim, vision_feature_dim, fusion_output_size, dropout_p, ): super(LanguageAndVisionConcat, self).__init__() self.language_module = language_module self.vision_module = vision_module self.fusion = torch.nn.Linear( in_features=(language_feature_dim + vision_feature_dim), out_features=fusion_output_size ) self.fc = torch.nn.Linear( in_features=fusion_output_size, out_features=num_classes 17/20
) self.loss_fn = loss_fn self.dropout = torch.nn.Dropout(dropout_p) def forward(self, text, image, label=None): text_features = torch.nn.functional.relu( self.language_module(text) ) image_features = torch.nn.functional.relu( self.vision_module(image) ) combined = torch.cat( [text_features, image_features], dim=1 ) fused = self.dropout( torch.nn.functional.relu( self.fusion(combined) ) ) logits = self.fc(fused) pred = torch.nn.functional.softmax(logits) loss = ( self.loss_fn(pred, label) if label is not None else label ) return (pred, loss) Step 4: Training a multimodal model 18/20
Next, we need to define our training loop.We first define our loss function and optimizer then set up our data loaders and model. We then train the model for a certain number of epochs, iterating through batches of data and computing the loss and gradients concerning the model parameters. After each epoch, we evaluate the model on the validation set and print out the validation accuracy. Here’s an example of what that might look like: import torch.optim as optim import torch.nn as nn from path.to.your.file import LanguageAndVisionConcat # Make sure to replace path.to.your.file with the correct path to the file where the LanguageAndVisionConcat class is defined. # LanguageAndVisionConcat it's a model that concatenates image and text features for some kind of joint task (e.g., image captioning, visual question answering, etc.). # define our loss function and optimizer criterion = nn.CrossEntropyLoss() model = LanguageAndVisionConcat(image_module, text_module, num_classes) optimizer = optim.Adam(model.parameters(), lr=0.001) # set up our data loaders and model train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False) # train the model num_epochs = 10 for epoch in range(num_epochs): for i, batch in enumerate(train_loader): images, texts, labels = batch optimizer.zero_grad() logits = model(images, texts) loss = criterion(logits, labels) loss.backward() optimizer.step() 19/20
if i % 100 == 0: print(f"Epoch {epoch+1}, batch {i+1}: loss {loss.item():.4f}") # evaluate on validation set total_correct = 0 total_samples = 0 with torch.no_grad(): for batch in val_loader: images, texts, labels = batch logits = model(images, texts) predictions = torch.argmax(logits, dim=1) total_correct += (predictions == labels).sum().item() total_samples += len(labels) accuracy = total_correct / total_samples print(f"Epoch {epoch+1} validation accuracy: {accuracy:.4f}") Endnote The emergence of multimodal AI and the release of GPT-4 mark a significant turning point in the field of AI, enabling us to process and integrate inputs from multiple modalities. Multimodal AI has the potential to revolutionize industries and sectors in ways we have never seen before. Imagine a world where machines and technology can understand and respond to human input more naturally and intuitively. With the advent of multimodal AI, this world is possible and quickly becoming a reality. From healthcare and education to entertainment and gaming, the possibilities of multimodal AI applications are limitless. Furthermore, the impact of multimodal AI on the future of work and productivity cannot be overstated. With its ability to streamline processes and optimize operations, businesses will be able to work more efficiently and effectively, achieving increased profitability and growth in the process. The potential of this technology is immense, and it is incumbent upon us to fully embrace and leverage it for the betterment of society. We have a timely opportunity to seize the potential of this technology and shape a brighter future for ourselves and the future generations. 20/20