After the success of ChatGPT and the millionaire investment in OpenIA, Microsoft He made it clear that his future is linked to artificial intelligence. The most recent evidence is Kosmos-1, a new AI model capable of analyzing images and answering an IQ test. According to Microsoft, this multimodal model would pave the way for the development of a general artificial intelligence.
A report of ArsTechnica mentions that Microsoft published the first Kosmos-1 document in arXiv, Cornell University service. titled Language is not all you need: aligning perception with language modelsthe study shows the results of un new multimodal extended language model (MLLM). According to the researchers, the AI can perceive general modalities, learn in context, and follow instructions.
The first results show that Kosmos-1 performs impressively on language comprehension and perception tasks, recognition of images and visual text, and is even able to answer an IQ test. The AI model can analyze images and answer questions about them, recognize text within them, and can caption them.
The difference between Kosmos-1 and ChatGPT
Unlike ChatGPT, Kosmos-1 considers input modes such as text, images, audio and video. Although LLM models, such as the one developed by OpenAI, have served as a general-purpose interface in various natural language tasks, they do have a downside.
The LLM-based interface can be tailored to a task, as long as we can transform the input and output to text. Despite successful applications in natural language processing, there is still a struggle to use LLMs natively for multimodal data such as images and audio.
According to the researchers, the AI natively supports language, perception-language, and vision tasks. “Kosmos-1 is a multimodal language model (MML) that can perceive general modalities, follow instructions, learn in context, and generate results,” they mention.
the AI trained using extracts of The Pilean open source dataset 825 GB intended for extensive models. Similarly, Microsoft made use of Common Crawl, a gigantic repository of web data. After a training and tuning phase, the engineers carried out a series of evaluations and the results are promising.
In tests, Kosmos-1’s AI was able to answer questions about some images, such as the type of hairstyle of an athlete, the reason why a child cried or why a photo was funny. likewise performed simple math operations and text and number recognition, like the release date on a movie poster. In some cases, the AI provides more context and accurately answers follow-up questions.
Perhaps the most interesting thing about the evaluation is their performance on the test of Raven’s progressive matrices. The test consists of analyzing and completing a sequence of forms and is used to measure human intelligence and abstract reasoning. In Raven’s test, Kosmos-1 answered a question correctly 22 percent of the time, outperforming the random probability of 17 percent.
The results indicate that the model is able to perceive abstract patterns in a non-verbal context. According to the scientists, this is the first time that an AI has performed zero-shot tests in the Raven test. Yes ok the evaluation is still a long way from what an average adult can getKosmos-1 demonstrates that multimodal languages are the key to the development of artificial intelligence that surpasses humans.
Microsoft is taking the first steps in general AI. It is important to mention that Kosmos-1 has no relation to ChatGPT. The engineers have developed this model without the involvement of OpenAI and have plans to open it up to other developers via the GitHub page.