NLP State of the Art | BERT

BERT: Bidirectional
Encoder Representation
from Transformer
By: Shaurya Uppal

Defining Language
Language:- Divided into 3 Parts
● Syntax
● Semantics
● Pragmatics
Syntax- Word Ordering, Sentence form
Semantics- Meaning of word
Pragmatics- refers to the social language skills that
we use in our daily interactions with others.

Example of Syntax, Semantics, Pragmatics
+ This discussion is about BERT.
+ The green frogs sleep soundly.
+ BERT play football good
SSP
SS
None

Why study about BERT?
Bert has ability to perform state of the art performance in many Natural Language
Processing Tasks. It can perform tasks such as Text Classification, Text Similarity
finding, Next Sentence Sequence Prediction, Question Answering, Auto-
Summarization, Named Entity Recognition,etc.
What is BERT Exactly?
BERT is Pretrained model by Google, which is a
bidirectional representation from unlabeled text by
jointly conditioning on both left and right context in
all layers.

Dataset used to Pre-train BERT
+ BooksCorpus (800M words)
+ English Wikipedia (2,500+M words)
A pretrained model can be applied by feature-based approach or fine tuning.
+ In Fine Tuning all weights change.
+ In Feature based approach only the final layer weights change. (Approach by
ELMo)
This pretrained model is then fine tuned on different NLP tasks.
Pretraining and Fine Tuning: You train a model m on Data A, then this model m is
trained on Data B from the checkpoint. SLIDE 17

Language
Training
Approach
To train a language model:
Two approaches
Context Free
+ Traditionally we use to convert word2vec
or use Glove.
Contextual
+ RNN
+ BERT

How does BERT Work?
BERT weights are learned in advance through two unsupervised tasks: masked
language modeling (predicting a missing word given the left and right context)
and next sentence prediction (predicting whether one sentence follows another).
BERT makes use of Transformer, an attention mechanism that learns contextual
relations between words (or sub-words) in a text.
(Paper 2 Attention is all you need)
Multi-headed attention is used in BERT. It uses multiple layers of attention and
also incorporates multiple attention “heads” in every layer (12 or 16). Since model
weights are not shared between layers, a single BERT model effectively has up to
12 x 12 = 144 different attention mechanisms.

What does BERT learn, how it tokenize and handle
OOV?
Consider the input example:- I went to the store. At the store, I bought fresh
strawberries.
BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word
into segments. For example, if play, ##ing, and ##ed are present in the
vocabulary but playing and played are OOV words then they will be broken down
into play + ##ing and play + ##ed respectively. (## is used to represent sub-
words).
BERT also requires a [CLS] special classifier token at beginning and [SEP] at end
of a sequence.
[CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]

Attention
An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by
concatenating the entries aij, in every attention matrix from every attention head in every layer.

Some visual Attention Patterns and Why we use
Attention Mechanism?
Reason for Attention: Attention helps in two main
tasks of BERT, MLM (Masked Language Model)
and NSP (Next Sentence Prediction).

Visual Pattern from Attention mechanism
● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN
● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN
● Attention to identical/related word.
● Attention to identical words in other sentence. | Helps in nextsentence prediction task
● Attention to other words predictive of word.
● Attention to delimiter tokens [CLS], [SEP]
● Attention to Bag of Words.

MLM: Masked Language Model
Input: My dog is hairy.
Masking is done randomly, and 15% of all WordPiece tokens in each input
sequence in masked. We only predict the masked tokens rather than predicting
the entire input sequence.
Procedure:
+ 80% of the time: Replace the word with [MASK]. My dog is [MASK].
+ 10% of time: Replace word randomly. My dog is apple.
+ 10% of time: Keep same My dog is hairy.

NSP: Next Sentence Prediction
Training Method:
In unlabelled data, we take a input
sequence A and 50% of time
making next occurring input
sequence as B. Rest 50% of time
we randomly pick any sequence as
B.

BERT Architecture
BERT is a multi-layer bidirectional Transformer encoder.
There are two models introduced in the paper.
● BERT base – 12 layers (transformer blocks), 12
attention heads, and 110 million parameters.
● BERT Large – 24 layers, 16 attention heads and, 340
million parameters.

BERT Pretraining and Fine Tuning Architecture

Illustration how the BERT Pretrain
architecture remain the same and
just the fine tuning layer architecture
change for different NLP tasks.

Related Work
EMLo:- A pretrained model based which is feature based (only final layer weights
change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention
based model with positional encodings to represent word positions). ELMo also failed because is was
word based and could not handle OOV.
OpenAI GPT: uses a left to right architecture where every token can only attend to
previous tokens in the self-attention layer of Transformer. Failed because it could
not get proper contextual knowledge.

How BERT Outperforms others?
In the paper Visualizing and Measuring the Geometry of BERT, we prove how
BERT holds semantic and syntax features of a text.
In this paper aims to show how attention matrix contains grammatical
representations. Turning to semantics, using visualizations of the activations
created by different pieces of text, we show suggestive evidence that BERT
distinguishes word senses at a very fine level.
BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a
pre-trained word piece embedding with position and segment information. Next, this initial sequence of
embeddings is run through multiple transformer layers, producing a new sequence of context embeddings
at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head,
each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]

Experiment for Syntax Representation
Experiment on corpus of Penn TreeBank (3.1M dependency relations). With
PyStanford Dependency Library we found the grammatical dependency on which
we ran BERT-base through each sentence and obtained model-wide attention
matrix. [ SLIDE 9].
On this dataset we train test split of 30% and achieve an accuracy of 85.8% on
binary probe and 71.9% on multiclass probe.
Proved: Attention mechanism contains syntactic features.

Geometry of Word Sense (Experiment)
On wikipedia articles with a query word we applied nearest-neighbor classifier
where each neighbour is the centroid of a given word sense’s BERT-base
embeddings in training data.

Conclusion
BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language
Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide
range of practical applications in the future.
Tested on our data of SupportLen for Text Classification.
We have a priority column in supportlen where we manually label
whether a customer email is urgent or not.
On this dataset we used BERT-base-uncased.
Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}

Some FAQs on BERT
1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT?
512 tokens
2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING
● Dropout – 0.1
● Batch Size – 16, 32
● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5
● Number of epochs – 3, 4
3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP?
No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained
simultaneously.
4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE?
Google took 4days to Pretrain BERT with 16TPUs.

ULMFiT: Universal Language Model Fine-tuning for Text Classification
ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on
text from the same domain as the target task. Now, along with BERT Pretrained model classification task
is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our
custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.

Future work and use cases that BERT can solve for us
+ Email Prioritization
+ Sentiment Analysis of Reviews
+ Review Tagging
+ Question-Answering for ChatBot & Community
+ Similar Products problem, we currently use cosine similarity on description
text.
Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain
dataset.

NLP State of the Art | BERT

In this document