Making machine Understand Natural Language better with BERT

Making machine Understand Natural Language better with BERT

BERT stands for Bidirectional Encoder Representations from Transformer. It is a substantial addition to the league of various approaches dealing with finding a solution to understand the context in a natural language- the way humans perceive. The results obtained are impressive without major modifications in task-specific architecture but by fine-tuning with just one additional output layer for a wide range of tasks, such as question-answering, sentimental analysis and language translation.

The credit for this phenomenal work is attributed to geniuses Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. Their paper entitled “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, gives an illustration of how the technology was constructed, implemented and tested on various Natural Language Processing applications. It showed commendable outcomes pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5 absolute improvements), outperforming human performance by 2.0.

Natural language is not just about a series of words put together but carries a semantic sense. It is a very difficult task to model an hard-coded architecture that can reciprocate to every instance of the language, which in itself is humongous and most importantly highly unstructured. Training a deep neural network with significant number of samples is the only way to capture the patterns effectively. The RNN based architectures have achieved significant results but when it comes to grasping the context in long sequences, it under-performs.

So what makes it understand the natural language in a better way?

Attention!! Yes. Exactly it is the self-Attention mechanism that gives it an extra-edge over the existing systems. Attention mechanisms have become an integral part of compelling sequence modelling and transduction models in various tasks, allowing modelling of dependencies without regard to their distance in the input or output sequences. To read more about the attention mechanism in detail do visit”.

The key idea!!!

The intuition behind the new language model, BERT, is simple yet powerful. Researchers believe that a large enough deep neural network model, with large enough training corpus, could capture the relationship behind the corpus.

In NLP domain, it is hard to get a large annotated corpus, so researchers used a novel technique to get a lot of training data. Instead of having human beings label the corpus and feed it into neural networks, researchers use the large Internet available corpus — BookCorpus (Zhu, Kiros et al) and English Wikipedia (800M and 2,500M words respectively). Two approaches, each for different language tasks, are used to generate the labels for the language model[2].

  • Masked language model: To understand the relationship between words. The key idea is to mask some of the words in the sentence (around 15 percent) and use those masked words as labels to force the models to learn the relationship between words. For example, the original sentence would be:
The man went to the store. He bought a gallon of milk.

And the input/label pair to the language model is:

Input: The man went to the [MASK1]. He bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon
  • Sentence prediction task: To understand the relationships between sentences. This task asks the model to predict whether sentence B, is likely to be the next sentence following a given sentence A. Using the same example above, we can generate training data like:
Sentence A: The man went to the store.
Sentence B: He bought a gallon of milk.
Label: IsNextSentence

Using BERT has two stages: Pre-training and fine-tuning.

Pre-training is fairly expensive (four days on 4 to 16 Cloud TPUs), but is a one-time procedure for each language (current models are English-only, but multilingual models will be released in the near future). Google has released number of pre-trained models which you can download from here. Most NLP researchers will never need to pre-train their own model from scratch.

Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art[1].

Now that we know a fundamentals of BERT. Let us explore it more!!

BERT can find solutions for 4 types of NLP problems basically.

Task 1: Sentence pair classification: Generally, the goal is to determine if two sentences are are semantically equivalent, contradicting, neutral etc.

Figure 1: Architecture for Sentence Pair classification task[3]

The BERT Pre-trained model can be fine tuned using the MNLI,QQP data sets with no substantial effort..

Let us consider an use case scenario:

Consider a statement given by a particular political party spokesperson. The counterpart would tend to respond to statement. We can assess the relationship between the two sentences and categorize it into different segments. Another use case could be the assessment of speeches given by a person on different occasions and validate if those were contradicting or inline.

Task 2: Single sentence classification

Figure 2: Architecture for single sentence classification[3]

Model can be fine-tuned using SST2, COLa data sets to avail the functionality.

Simple use case: Analyzing the tweet sentiments.

Tweet is form of expression. Understanding the context behind the tweet is a fairly challenging task for a machine as it may not be quantifies just by the weight of the words most of the time.

Task 3: Question Answering

Figure 3: Architecture for question answering tasks[3]

The BERT model could be fine-tuned on SQuAD data set to exhibit the functionality to a question answering system. It can be further customized by tuning it with the specific data

Sample input and output:

Input Question: Which country has the most powerful army.

Input Paragraph: The US Army size dwarfs that of most other nations and together with the country’s high expenditure on defense makes it a formidable force and arguably has the world’s most powerful army..

Output Answer: US Army

Applications: It can be used in  building chat bots and other question answering systems.

Task 4: Single sentence tagging

Figure 4: Single sentence tagging task[3]

To evaluate performance on a token tagging task, we fine-tune BERT on the CoNLL 2003 Named Entity Recognition (NER) data set. This data set consists of 200k training words which have been annotated as Person, Organization, Location, Miscellaneous, or Other (non-named entity).

Use case: Categorize featured text into relevant in the named entities


The blog highlights on how distinct applications of natural language processing are realized using a pre-trained Language model by appropriately fine-tuning it with relevant data sets. It has exhibited state of the art performance in accordance with the given domain and could be extended to any given domain.

All the beRt!!!



Leave a Reply

Your email address will not be published. Required fields are marked *