An Optimistic Outlook from GPT-2

2019-07-15

Today, I was playing with GPT-2, OpenAI's latest version of a language model. GPT-2 is a relatively simple approach. You take a deep neural network and train it so that it predicts the continuation of a piece of text. As the label occurs naturally after the feature in any text, we immediately have massive training data. And if we give it some thought, this approach is potent.

What this artificial neural network can learn depends on the model's complexity. A simple model might be able to learn frequent combinations of letters. Well-sounding but not necessarily existing words may be generated because it has learned the phonetic structure. By increasing the number of parameters, the model might grasp grammatical structures. For example, it's more likely that the fragment "The dog" is followed by barks than by car. For the model, the token barks is only statistically different than car. Obviously, we can express the grammar of a language with such distributions. The probability of any grammatically correct continuation is larger than zero. And for incorrect continuations, it is zero (assuming grammatically correct training data).

In an extreme case, the model might memorize all possible continuations of the texts. This will lead to results similar to a search engine. However, this is not what we see with GPT-2. I suppose, we can see these fascinating continuations because the model learns meaningful abstractions. To achieve this is somewhat of a black art because the engineers need to select the number of parameters and regularization carefully. They need to set these in a way so that the model performs well on evaluation data. One can readily think of cases where a model needs to have a true understanding of a text to be able to predict meaningful continuations. For example, the following two sentences are very similar:

The surgeon had to open the lab in order to...
The scientist had to open the lab in order to...

However, one might expect completely different continuations:

...perform the life-saving operation on the little patient.
...to unravel the mysteries of the universe.

It seems that a penalty on the error of the continuations is enough to push a model for meaningful abstractions. Many texts require meticulous understanding at the beginning to make sense of what follows. Two examples: In mathematical papers, axioms are initially presented, followed by the deduction of subsequent theorems. Crime stories introduce evidence initially, prompting readers to analyze these clues to unravel the culprit.

It seems currently, such language models learn superficial statistics over words. But it doesn't seem like we lack a magic ingredient to improve predictions. It's just more or better training data, more trainable parameters and regularization.

I think the way Charles Darwin came up with the theory of evolution provides a good analogy how humans build abstractions. Darwin went on a journey to various continents. During this voyage, he was exposed to various organisms, observed and took notes. When reviewing this material, he could see a common explanation, or a model. This theory of evolution allowed Darwin to explain phenomena he had never seen. Simply because he understood the underlying mechanism. Darwin's theory of evolution can be described in a couple of pages, whereas a description of all forms of natural life would fill shelves.

Similarly, a language model needs to compress many facts into a coherent model to be a able to predict as described.

It doesn't need to come up with something new. It's enough to ingest everthing that's out there. No human can overwier their entire field but maybe such a model can and just based on the literature that is out there come ou with some insight

The difference between making something new and assembling things that are out there is tiny.

I don't see why we cannot create a model that captures the universe of textual knowledge. And I guess this will be disruptive. The mechanical machines people built since the industrial revolution obviously aren't as sophisticated as humans workers. But machines, nevertheless, became critical for almost all manual labor. Similarly, I expect further version of such models to become critical tools for knowledge workers.