The pipeline for
State-of-the-Art NLP
Hugging Face
Agenda
Lysandre DEBUT
Machine Learning Engineer @ Hugging Face,
maintainer and core contributor of
huggingface/transformers
Anthony MOI
Technical Lead @ Hugging Face, maintainer and
core contributor of huggingface/tokenizers
Some slides were adapted from previous
HuggingFace talk by Thomas Wolf, Victor Sanh and
Morgan Funtowicz
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Hugging Face
Hugging Face
Most popular open source NLP
library
▪ 1,000+ Research paper
mentions
▪ Used in production by 1000+
companies
Hugging Face
Today’s Menu
Subjects we’ll dive in today
● NLP: Transfer learning, transformer networks
● Tokenizers: from text to tokens
● Transformers: from tokens to predictions
Transfer Learning - Transformer networks
One big training to rule them all
NLP took a turn in 2018
Self-supervised Training &
Transfer Learning
Large Text Datasets
Compute Power
The arrival of the transformer architecture
Transfer learning
In a few diagrams
Sequential transfer learning
Learn on one task/dataset, transfer to another task/dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
DistilBERT
Text classification
Word labeling
Question-Answering
....
Pre-training Adaptation
Computationally
intensive
step General purpose
model
Transformer Networks
Very large models - State of the Art in several tasks
Transformer Networks
● Very large networks
● Can be trained on very big datasets
● Better than previous architectures at maintaining
long-term dependencies
● Require a lot of compute to be trained Source: BERT: Pre-training of Deep Bidirectional
Transformers for
Language Understanding. Jacob Devlin, Ming-Wei Chang,
Kenton Lee, Kristina Toutanova.In NACCL, 2019.
Transformer Networks
Pre-training
Base model
Pre-trained language
model
Very large corpus
$$$ in compute
Days of training
Transformer Networks
Fine-tuning
Pre-trained language
model
Fine-tuned language
model
Training can be done on single GPU
Small dataset
Easily reproducible
Model Sharing
Reduced compute, cost, energy footprint
From 🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
distilled version of BERT, by Victor Sanh
A deeper look at the inner mechanisms
Pipeline, pre-training, fine-tuning
Adaptation
Head
Pre-trained
model
Tokenizer
Transfer Learning pipeline in NLP
From text to tokens, from tokens to prediction
Jim
Henson
was
a
puppet
##eer
11067
5567
245
120
7756
9908
1.2 2.7 0.6 -0.2
3.7 9.1 -2.1 3.1
1.5 -4.7 2.4 6.7
6.1 2.4 7.3 -0.6
-3.1 2.5 1.9 -0.1
0.7 2.1 4.2 -3.1
True 0.7886
False 0.223
Jim Henson was a puppeteer
Tokenization
Convert to
vocabulary indices
Pre-trained model
Task-specificmodel
Pre-training
Many currently successful pre-training approaches are based on language
modeling: learning to predict Pϴ
(text) or Pϴ
(text | other text)
Advantages:
- Doesn’t require human annotation - self-supervised
- Many languages have enough text to learn high capacity models
- Versatile - can be used to learn both sentence and word representations with
a variety of objective functions
The rise of language modeling pre-training
Language Modeling
Objectives - MLM
['The', 'pipeline', 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', 'Natural', 'Language', 'Process', '##ing']
The pipeline for State-of-the-Art Natural Language Processing
['The', ‘pipeline’ 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', [MASK], 'Language', 'Process', '##ing']
Tokenization
Masking
['The', ‘pipeline’, 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', [MASK], 'Language', 'Process', '##ing']
‘Natural’
‘Artificial’
‘Machine’
‘Processing’
‘Speech’
Prediction
Language Modeling
Objectives - CLM
['The', 'pipeline', 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', 'Natural', 'Language', 'Process', '##ing']
The pipeline for State-of-the-Art Natural Language Processing
Tokenization
Prediction
['Process', '##ing', '(', 'NL', '##P', ')', 'software', 'which', 'will', 'allow', 'a', 'user', 'to', 'develop']
Tokenization
It doesn’t have to be slow
Tokenization
- Convert input strings to a set of numbers
Its role in the pipeline
Jim Henson was a puppet ##eer
11067 5567 245 120 7756 9908
Jim Henson was a puppeteer
- Goal: Find the most meaningful and smallest possible representation
Some examples
Let’s dive in the nitty-gritty
Word-based
Word by word tokenization
Let’s do tokenization!
Let ‘s do tokenization !
Split on punctuation:
Split on spaces:
▪ Split on spaces, or following specific rules to obtain words
▪ What to do with punctuation?
▪ Requires large vocabularies: dog != dogs, run != running
▪ Out-of-vocabulary (aka <UNK>) tokens for unknown words
Character
Character by character tokenization
▪ Split on characters individually
▪ Do we include spaces or not?
▪ Smaller vocabularies
▪ But lack of meaning -> Characters don’t necessarily have a meaning separately
▪ End up with a huge amount of tokens to be processed by the model
L e t ‘ s d o t o k e n i z a t i o n !
Byte Pair Encoding
Welcome subword tokenization
▪ First introduced by Philip Gage in 1994, as a compression algorithm
▪ Applied to NLP by Rico Sennrich et al. in “Neural Machine Translation of Rare Words with
Subwords Units”. ACL 2016.
Byte Pair Encoding
Welcome subword tokenization
A B C ... a b c ... ? ! ...
Initial alphabet:
▪ Start with a base vocabulary using Unicode characters seen in the data
▪ Most frequent pairs get merged to a new token:
1. T + h => Th
2. Th + e => The
Byte Pair Encoding
Welcome subword tokenization
▪ Less out-of-vocabulary tokens
▪ Smaller vocabularies
Let’s</w> do</w> token ization</w> !</w>
And a lot more
So many algorithms...
▪ Byte-level BPE as used in GPT-2 (Alec Radford et al. OpenAI)
▪ WordPiece as used in BERT (Jacob Devlin et al. Google)
▪ SentencePiece (Unigram model) (Taku Kudo et al. Google)
Tokenizers
Why did we build it?
▪ Performance
▪ One API for all the different tokenizers
▪ Easy to share and reproduce your work
▪ Easy to use any tokenizer, and re-train it on a new language/dataset
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Strip
▪ Lowercase
▪ Removing diacritics
▪ Deduplication
▪ Unicode normalization (NFD, NFC, NFKC, NFKD)
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Set of rules to split:
- Whitespace use
- Punctuation use
- Something else?
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Actual tokenization algorithm:
- BPE
- Unigram
- Word level
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Add special tokens: for example [CLS], [SEP] with BERT
▪ Truncate to match the maximum length of the model
▪ Pad all sequence in a batch to the same length
▪ ...
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
How to install it?
Transformers
Using complex models shouldn’t be complicated
Transformers
An explosion of Transformer architectures
▪ Wordpiece tokenization
▪ MLM & NSP
BERT
ALBERT
GPT-2
▪ SentencePiece tokenization
▪ MLM & SOP
▪ Repeating layers
▪ Byte-level BPE tokenization
▪ CLM
Same API
Transformers
As flexible as possible
Runs and trains on:
▪ CPU
▪ GPU
▪ TPU
With optimizations:
▪ XLA
▪ TorchScript
▪ Half-precision
▪ Others
All models
BERT & RoBERTa
More to come!
Transformers
Tokenization to prediction
transformers.PreTrainedTokenizer
transformers.PreTrainedModel
The pipeline for State-of-the-Art Natural Language Processing
[[464, 11523, 329, 1812, 12, ..., 15417, 28403]]
Tensor(batch_size, sequence_length, hidden_size) Task-specific prediction
Base model With task-specific head
Transformers
Available pre-trained models
transformers.PreTrainedTokenizer
transformers.PreTrainedModel
▪ We publicly host pre-trained tokenizer vocabularies and
model weights
▪ 1611 model/tokenizer pairs at the time of writing
Transformers
Pipelines
transformers.Pipeline
▪ Pipelines handle both the tokenization and prediction
▪ Reasonable defaults
▪ SOTA models
▪ Customizable
A few use-cases
That’s where it gets interesting
Transformers
Sentiment analysis/Sequence classification (pipeline)
Transformers
Question Answering (pipeline)
Transformers
Causal language modeling/Text generation
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Training models
Example scripts (TensorFlow & PyTorch)
- Named Entity Recognition
- Sequence Classification
- Question Answering
- Language modeling (fine-tuning & from scratch)
- Multiple Choice
Trains on TPU, CPU, GPU
Example scripts for PyTorch Lightning
Transformers
Just grazed the surface
The transformers library covers a lot more ground:
- ELECTRA
- Reformer
- Longformer
- Encoder-decoder architectures
- Translation & Summarization
Transformers + Tokenizers
The full pipeline?
Data Tokenization Prediction
🤗 nlp Tokenizers Transformers
Metrics
🤗 nlp
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools