Microsoft and NVIDIA just Announce the Megatron-Turing Natural Language Generation Model (MT-NLG), powered by its DeepSpeed and Megatron technologies. It is a monolithic model of a transformed language that, according to the manufacturing companies, stands out for being “the largest and most powerful monolithic model of transformed language trained to date“.
NVIDIA and Microsoft have achieved training efficiency with their new language. Its strengths include a training infrastructure accelerated by a next-generation GPU. with a stack of distributed learning software. In the following graph, the companies have made a comparison between Megatron-Turing and other models, such as the main one known so far, the GPT-3:
As the successor to the Turing NLG 17B and Megatron-LM, MT-NLG has three times the parameters of the largest existing model of this type giving you greater precision in a broad set of natural language tasks. It has the ability of prediction to finish words, reading comprehension, common sense reasoning, Inferences in natural language and disambiguation of the meaning of words.
From Nvidia they explain that it will be necessary to see how the MT-NLG will shape the products of the future and motivate the community to expand the limits of natural language processing (NLP). Linguistic models with a large number of parameters, more data and more training time acquire a richer and more nuanced understanding of language, for example, by acquiring the ability to summarize books and even complete programming code.
The software that has joined
According to Nvidia, the collaboration brought together software from NVIDIA Megatron-LM and Microsoft DeepSpeed, to create an efficient and scalable 3D parallel system. able to combine data-driven parallelism, pipeline and tensor-slicing to solve these problems.
The system uses the tensor-slicing by Megatron-LM to scale the model within a node and use DeepSpeed pipe parallelism to scale the model between nodes.
For example, for the 530 billion model, each model replica encompasses 280 NVIDIA A100 GPUs, with 8-way tensor-slicing within a node and pipeline parallelism 35-way between nodes.
To train MT-NLG, Microsoft and Nvidia claim to have created a data set of training with 270 billion English website tokens. Tokens, a way of separating parts of text into smaller units in natural language, can be words, characters, or parts of words.
The dataset used for this development comes largely from The Pile, an 835GB collection of 22 smaller datasets created using the open source EleutherAI Artificial Intelligence. “The Pile” encompasses academic sources (such as Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github), etc.
Comparison with GPT-3
To get an idea of its power, This Megatron-Turing (MT-NLG) model includes 530,000 million parameters, triple the number of the largest existing model so far, the GPT-3. It should be remembered that the GPT-3 has been created by OpenAI, the famous non-profit organization focused on artificial intelligence research founded by Elon Musk, and in which companies like Microsoft have invested hundreds of millions of dollars.
The language model called GPT-3 is capable of programming, designing and even talking about politics and economics. The tool was offered to the public as an open source API.
At its launch last year, GPT-3 was the most powerful language model created to date. It is an artificial intelligence, a machine learning model that analyzes text or data to provide word predictions based on all words previous. It is what is used in applications of natural language processing or NLP.