Neural Architectures for Named Entity Recognition

Neural Architectures for Named
Entity Recognition
GUILLAUME LAMPLE, MIGUEL BALLESTEROS, SANDEEP SUBRAMANIAN,
KAZUYA KAWAKAMI AND CHRIS DYER
NAACL-HTL 2016

Name Entity Recognition
• Task of identifying proper names in text and classifying into set of predefined
categories of interest
• Three universally accepted categories:
- Person, location and organization
• Other common tasks:
- Recognition of date, time, email address, measures etc.
• Other domain specific entities:
- Names of drugs, genes, bibliographic references etc.
2

Name Entity Recognition
• Example:
Lady Gaga is playing a concert for the Bushes in Texas next September
Person Person Location Time
• Why NER?
- Machine Translation
- Question Answering
- Information retrieval
- Text-to-speech
3

Challenges
• Small amount of supervised training data
- Language specific features and knowledge resources are required
- Costly to develop in new languages or domain
• Unsupervised learning offers an alternative
- Existing systems [1,2] rely on unsupervised features to augment hand-
engineered features
4

Solution
• Neural architectures
- No language-specific resources or features
- A small amount of supervised training data and unlabeled corpora
• Two models
1. A bidirectional LSTM with a sequential conditional random layer (CRF) above
it : LSTM-CRF
2. Transition based chucking model using stack LSTMs : S-LSTM
5

Intuitions
• Names often consists of multiple tokens
- LSTM-CRF : Captures dependencies across labels
- S-LSTM : Consturcts labelled chucks of input sequence directly
• Token evidence for being a name includes both orthographic evidence and
distributional evidence
- Orthographic representation – Character-based word representation
- Distributional representation – Word embedding
6

Word Representation
• Character based representation + Word embedding
• Character based model
- Proposed by Ling et al. [3]
- Randomly initialized character embedding matrix
- Bidirectional LSTM captures orthographic information of a word
• Word embedding
- Pretrained embedding using skip-ngram [4]
• Dropout to encourage the model to depend on both representation
7

Word Representation
8
100 dimension vector 50 (25 + 25) dimension vector+
Dropout

Why not a CNN?
• Existing approaches [5, 6] use CNN for character based word representation
• CNN is designed to discover position-invariant features
• In a word, important information are position dependent
- e.g. Prefix and Suffix
9

LSTM-CRF Model
• Bidirectional LSTM transforms a word representation to context representation
10
it = σ(Wi[xt, ht-1, ct-1] + bi)
ot = σ(Wo[xt, ht-1, ct-1] + bo)
ct = (1- it)*ct-1 + it*tanh(Wc[xt, ht-1, ct-1] + bc)
ht = ot*tanh(ct)
h't = [ht ; ht]
X +
X
-1
X
σ σtanh
ct-1
ht-1
ct
ht
xt
ht
it
ot

LSTM-CRF Model
• Output of LSTM is projected onto a hidden layer
P = Wph't
• P is of size of n x k
- k : number of distinct tags
- Pij : score of the jth tag of ith word
• P is input to the next layer - CRF
11

LSTM-CRF Model
• y* and summation of scores of all possible sequences are computed using
dynamic programming
• log-probability of the correct sequences are maximized during training
• Example:
14

Tagging scheme
• General tagging scheme – IOB format
- B-label : Beginning of a named entity
- I-label : Inside a named entity
- O-label : Outside a named entity
• Dai et al. [7] have showed that more expressive tagging scheme improves the
performance
• No significant performance improvement is observed with IOBES
- S : Singleton entities
- E : End of named entities
15

Stack LSTM
• Proposed by Dyer at al. [8] for dependency parsing
• LSTM is augmented with a stack pointer
- Output cell of stack point gives stack summary
- Used to deicde ct-1 and ht-1 for the new input
• Stack operations are simulated using the pointer
- Push : Add a new input to LSTM
- Pop : Moves stack pointer to previous element
16

Stack LSTM
• Example:
• Contents are never overwritten
17

Transition-Based Chunking Model
• Directly constructs representation of multi-token names
• Uses three stack LSTMs :
1. output : Contains completed chunks
2. stack : Contains partially completed chunks
3. buffer : Keeps words that have yet to be processed
18

• Three actions
1. SHIFT : Moves a word from buffer to stack
2. OUT: Moves a word from buffer to output
3. REDUCE(y): Moves stack content to output with the label y
• Algorithm stops when buffer and stack are empty
19

• Example: Sequence of operations required to process the sentence Mark
Watney visited Mars
20

• Probability distribution over possible actions at each time step is computed
using,
1. Current content of the stack, buffer and output
2. History of actions taken
• Maximum probability action is chosen greedily
• May not guaranteed to find a global optimum
21

22

23

Novelty
• LSTM-CRF
- Word representation proposed by Ling et al. [3] has been experimented
for language independent NER
• Stack-LSTM
- Stack LSTM Dyer at al. [8] has been experimented for language independent
NER
24

Evaluations
• Two datasets:
- ConLL 2002 and ConLL 2003
• Four languages:
- English, Spanish, German and Dutch
• Four types of names entities:
- Person, Location, Organization and Miscellaneous
25

Evaluations
Model F1
Lin and Wu (2009) 83.78
Passos et al. (2014) 90.05
Chiu and Nichols (2015) 90.69
LSTM-CRF 90.94
S-LSTM 90.33
26
English NER results (CoNLL-2003 test set) compared with models
trained with no external labeled data

Evaluations
Model F1
Collobert et al. (2011) 89.59
Lin and Wu (2009) 90.90
Huang et al. (2015) 90.10
Luo et al. (2015) + gaz 89.9
Luo et al. (2015) + gaz + linking 91.2
Passos et al. (2014) 90.90
Chiu and Nichols (2015) 90.77
LSTM-CRF 90.94
S-LSTM 90.33
27
English NER results (CoNLL-2003 test set) compared with models trained
with external labeled data

Evaluations
Model F1
Florian et al. (2003)* 72.41
Ando and Zhang (2005a) 75.27
Qi et al. (2009) 75.72
Gillick et al. (2015) 72.08
Gillick et al. (2015) * 76.22
LSTM-CRF 78.76
S-LSTM 75.66
28
German NER results (CoNLL-2003 test set)

Evaluations
Model F1
Carreras et al. (2002) 77.05
Nothman et al. (2013) 78.6
Gillick et al. (2015) * 82.84
LSTM-CRF 81.74
S-LSTM 79.88
29
Dutch NER results (CoNLL-2003 test set)

Evaluations
Model F1
Carreras et al. (2002) 81.39
Santos and Guimaraes (2015) 82.21
Gillick et al. (2015) * 82.95
LSTM-CRF 85.75
S-LSTM 83.93
30
Spanish NER results (CoNLL-2003 test set)

Evaluations
31
CoNLL-2002 and CoNLL-2003 test set results

Evaluations
Model Variant F1
LSTM char + dropout + pretrain 89.15
LSTM-CRF char + dropout 83.63
LSTM-CRF pretrain 88.39
LSTM-CRF char + pretrain 89.77
LSTM-CRF dropout + pretrain 90.20
LSTM-CRF char + dropout + pretrain 90.94
32
English NER results for variation of LSTM-CRF

Evaluations
Model Variant F1
S-LSTM char + dropout 80.88
S-LSTM pretrain 86.67
S-LSTM char + pretrain 89.32
S-LSTM dropout + pretrain 87.96
S-LSTM char + dropout + pretrain 90.33
33
English NER results for variation of S-LSTM

Evaluations
• LSTM-CRF model archives state-of-art performance in German and Spanish
• LSTM-CRF model outperforms all the existings model which do not use any
external labeled data
• Stack-LSTM model is more dependent on character-based representation
compare to LSTM-CRF
• Dropout on word representation layer significantly improves the performance
34

Predictions of LSTM-CRF
• Some correct predictions
Brokers__O said__O blue__O chips__O like__O IDLC__B-ORG ,__O Bangladesh__B-ORG
Lamps__I-ORG ,__O Chittagong__B-ORG Cement__I-ORG and__O Atlas__B-ORG
Bangladesh__I-ORG were__O expected__O to__O rise.__O
Jones__B-ORG Medical__I-ORG completes__O acquisition__O .__O
The__O Dow__B-ORG Chemical__I-ORG Co__I-ORG of__O the__O United__B-LOC
States__I-LOC will__O invest__O $__O 4__O billion__O to__O build__O an__O
ethylene__O plant__O in__O Tianjin__B-LOC city__O in__O northern__O China__B-LOC
,__O the__O China__B-ORG Daily__I-ORG said__O on__O Saturday__O .__O
35

Predictions of LSTM-CRF
• Some bad predictions
Jordan__B-LOC defends__O his__O decision__O to__O make__O the__O film__O ,__O
whose__O screenplay__O he__O wrote__O himself__O after__O years__O of__O
research__O ,__O saying__O it__O was__O more__O about__O history__O than__O
any__O political__O statement.__O
Cofinec__B-ORG said__O Petofi__B-LOC general__O manager__O Laszlo__B-PER
Sebesvari__I-PER had__O submitted__O his__O resignation__O and__O will__O be__O
leaving__O Petofi__B-ORG but__O will__O remain__O on__O Petofi__B-ORG 's__O
board__O of__O directors.__O
36

Related Work
• Neural Architectures
1. CNN-CRF [1]
2. LSTM-CRF [9]
• Language independent NER
1. Bayesian approach [10]
• NER with character-based representation
1. Byte level processing of strings [11]
2. CNN-LSTM [12]
37

Strength
• Captures language independent nature of NER
• Achieves state-of-art in multiple languages with a simple architecture
• Reason behind each decisions made is explained clearly
38

Weakness
• No architectural wise improvement to the existing models
• Lack of explanation for S-LSTM model
39

Future Work
• Extend CRF layer to higher order CRF [13]
• Experiment the model performance by jointly learning NER and other NLP task
[14]
40

Reference
[1] Ronan Collobert, Jason Weston, Leon Bottou, Michael ´Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research,
12:2493–2537.
[2] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general
method for semi-supervised learning. In Proc. ACL.
[3] Wang Ling, Tiago Lu´ıs, Lu´ıs Marujo, Ramon Fernandez ´Astudillo, Silvio Amir, Chris Dyer, Alan W Black,
and Isabel Trancoso. 2015b. Finding function in form: Compositional character models for open vocabulary
word representation. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP).
[4] Wang Ling, Lin Chu-Cheng, Yulia Tsvetkov, Silvio Amir, Ramon Fernandez Astudillo, Chris Dyer, Alan W
´Black, and Isabel Trancoso. 2015a. Not all contexts are created equal: Better word representations with
variable attention. In Proc. EMNLP.
[5] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text
classification.
In Advances in Neural Information Processing Systems, pages 649–657.
[6] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. Character-aware neural language
models. CoRR, abs/1508.06615
[7] Hong-Jie Dai, Po-Ting Lai, Yung-Chun Chang, and Richard Tzong-Han Tsai. 2015. Enhancing of chemical
compound and drug name recognition using representative tag scheme and fine-grained tokenization.
Journal of cheminformatics, 7(Suppl 1):S14.
41

Reference
[8] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transitionbased
dependency parsing with stack long short-term memory. In Proc. ACL
[9] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR,
abs/1508.01991. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. Character-aware
neural language models. CoRR, abs/1508.06615.
[10] Jacob Eisenstein, Tae Yano, William W Cohen, Noah A Smith, and Eric P Xing. 2011. Structured
databases of named entities from bayesian nonparametrics. In Proceedings of the First Workshop on
Unsupervised Learning in NLP, pages 2–12. Association for Computational Linguistics.
[11] Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya. 2015. Multilingual language processing
from bytes. arXiv preprint arXiv:1512.00103
[12] Jason PC Chiu and Eric Nichols. 2015. Named entity recognition with bidirectional lstm-cnns. arXiv
preprint arXiv:1511.08308.
[13] S. Sarawagi, W. W. Cohen. Semi-Markov conditional random fields for information extraction. NIPS,
2004
[14] Gang Luo, Xiaojiang Huang, Chin-Yew Lin, and Zaiqing Nie. 2015. Joint named entity recognition and
disambiguation. In Proc. EMNLP.
42

Neural Architectures for Named Entity Recognition

More Related Content

What's hot

Similar to Neural Architectures for Named Entity Recognition

Recently uploaded

In this document

Neural Architectures for Named Entity Recognition