Goal it does not want to give ground to Google and Microsoft in the race to create the best artificial intelligence. To show that his commitment is serious, Mark Zuckerberg’s company presented ImageBindan AI model that seeks to learn in the same way as human beings. To do this, Meta engineers adopted a multi-sensory framework that involves images, text, video and audio, as well as depth, thermal and inertia data.
ImageBind is part of Meta’s initiative to create multimodal systems that can learn from different types of data. The AI model not only understands an element, but is capable of linking it with other features. For example, you will be able to determine sound, shape, temperature, and the way objects in a photograph move.
“In typical AI systems, there is a specific embedding (i.e. vectors of numbers that can represent data and their relationships in machine learning) for each respective modality. ImageBind shows that it is possible to create a joint embedding space via multiple modalities without the need to train on data with each different combination of modalities.”
Goal
The company ensures that ImageBind outperforms other models trained for a particular modality. Unlike generative AIs like ChatGPT or Midjourneythe alternative of Meta binds six data types in a multidimensional index. Researchers could use any of these as an input method, or cross-reference them.
One of the features of ImageBind is that uses a learning concept similar to that of people. “When humans absorb information from the world, we innately use multiple senses,” Meta said. The company says that humans we are able to generate sensory experiences when viewing an image. For example, when looking at a photo of a Ferrari you might think about the sound of the engine or the speed at which it travels.
“ImageBind uses the binding property of images, which means they coexist with a variety of modalities and can serve as a bridge to connect them, such as linking text to image using web data or linking motion to video using video data captured from handheld cameras. with IMU sensors.”
Goal
The investigation returned that the Meta model can be improved by using few training examples. Although the first results are promising, it is still a while before we see ChatGPT-style applications using ImageBind. However, that hasn’t stopped the company from talking about the possibilities people would have when using it.
For example, ImageBind would be able to generate a suitable audio track for a video of the sea that you recorded while on vacation. Or a virtual reality experience that simulates traveling on a boat and adds all the necessary elements to make it immersive. Designers could create animated shorts based on an image and a sound file.
Goal announced that ImageBind will be open source, so those interested will be able to access the repository on GitHub. Unlike OpenAI, the technology giant confirmed that it will maintain its strategy of opening the code to everyone in order to improve it or detect bugs.