SelQA: A New Benchmark for
Selection-based Question Answering
Tomasz Jurczyk*, Michael Zhai, Jinho D. Choi
https://github.com/emorynlp/question-answering/
ICTAI 2016
11/8/2016
Selection-based Question Answering
How many airports are in Vietnam ?
- Vietnam operates 21 major civil airports , including three
international gateways: (...)
- Tan Son Nhat is the nation's largest airport, handling (...)
- According to a state - approved plan, Vietnam will have 10
international airports by 2015
- The planned Long Thanh International Airport will have an
annual service capacity of (...)
- (...)
● A ranking problem - selects the exact answer among the candidates
● Single question might have more than 1 correct answers.
Tasks in Selection-based Question Answering
● The original task in question answering
● It originated as a ranking problem
● There is at least a single answer among
candidates
● Measured by MAP and MRR scores.
Answer Sentence Selection
● Recently proposed the advanced version of
ASS task
● The assumption of having at least a single
answer is not present anymore
● Thus, the task can’t be considered a ranking
problem anymore
● Significantly more complex and difficult
● Measured by Precision and Recall.
Answer Triggering
SelQA - A New Benchmark for Question Answering Tasks
● A corpus based on documents of various tasks drawn from Wikipedia
● An effective annotation scheme is proposed to create a large dataset
● Additional annotation for questions w.r.t. topics, types, paraphrases provided
● Two recent state-of-the-art systems based on convolutional and recurrent
neural networks are implemented to provide strong baselines
The Process of Creating The Data Set
Task 1 & 2
● (Task 1) Given a section, annotators are asked to generate a question,
● (Task 2) Given the same section and highlighted previously used sentences,
the annotated are asked to generate another question,
● The annotators are provided the instructions, the topic, the article title, the
section title, and the list of numbered sentences in the section
● The question should be supported by one or more sentences in the paragraph
Observation: annotators tend to generate questions with some lexical overlap
with the corresponding contexts.
Task 3
● Given the context and the previously generated questions, the annotators are
asked to paraphrase the question.
● A necessary step in creating a corpus that evaluates reading comprehension
rather than ability to model word co-occurrences.
Observation: a significant drop in word co-occurrence
Task 4
Observation: Despite the high quality of the question constructed in 1-3 Tasks,
some of questions can be only answered having the additional context
Example: “How were the initial reviews?”
● ElasticSearch used to select suspicious questions
● The selected question are sent back to Amazon Mechanical Turk
The Process of Creating The Data Set
An example of the created question within tasks 1-4
Task-wise analysis w.r.t. WikiQA
Answer Triggering Data Set - Task 5
● Automatically generated using the previously generated questions
● Elastic used to index and query each question from the entire Wikipedia index
(~14 milion sections)
● For each question, top5 highest relevant sections are selected no matter it
contains the answer or not.
● In result, 40.76% of the questions have corresponding answer contexts
comparing to 39.25% in the WikiQA data set.
Neural Network approaches used for evaluation
● Two systems using convolutional neural network and recurrent neural network
are used to evaluate
● Additionally, we propose a subtree matching mechanism for measuring
contextual similarity between two sentences (applied with ConvNet system)
1. Convolutional neural network model: a single convolution with max pooling,
used as a feature in logistic regression model with several lexical features
(including subtree matching features)
2. Recurrent neural network model: GRU-based bidirectional RNN with
attention.
Convolutional Neural Network with Logistic Regression
The subtree matching mechanism
For every common word wi
between question q and sentence a, calculate a similarity score
based on the similarity of the words’ parents, siblings and children nodes
Answer Sentence Selection on WikiQA
Answer Sentence Selection on SelQA
Answer Triggering on WikiQA
Answer Triggering on SelQA
Conclusion & Future Work
● A new benchmark for selection-based question answering presented
● Several configurations consisting of the state-of-the-art neural network
models used to analyze the introduced corpus
● Analysis on various aspects w.r.t different models shown
● More research on providing context-aware qa systems is needed
● We plan to continue working on large scale corpora for open-domain question
answering.
Thank you!
Questions?

SelQA: A New Benchmark for Selection-based Question Answering

  • 1.
    SelQA: A NewBenchmark for Selection-based Question Answering Tomasz Jurczyk*, Michael Zhai, Jinho D. Choi https://github.com/emorynlp/question-answering/ ICTAI 2016 11/8/2016
  • 2.
    Selection-based Question Answering Howmany airports are in Vietnam ? - Vietnam operates 21 major civil airports , including three international gateways: (...) - Tan Son Nhat is the nation's largest airport, handling (...) - According to a state - approved plan, Vietnam will have 10 international airports by 2015 - The planned Long Thanh International Airport will have an annual service capacity of (...) - (...) ● A ranking problem - selects the exact answer among the candidates ● Single question might have more than 1 correct answers.
  • 3.
    Tasks in Selection-basedQuestion Answering ● The original task in question answering ● It originated as a ranking problem ● There is at least a single answer among candidates ● Measured by MAP and MRR scores. Answer Sentence Selection ● Recently proposed the advanced version of ASS task ● The assumption of having at least a single answer is not present anymore ● Thus, the task can’t be considered a ranking problem anymore ● Significantly more complex and difficult ● Measured by Precision and Recall. Answer Triggering
  • 4.
    SelQA - ANew Benchmark for Question Answering Tasks ● A corpus based on documents of various tasks drawn from Wikipedia ● An effective annotation scheme is proposed to create a large dataset ● Additional annotation for questions w.r.t. topics, types, paraphrases provided ● Two recent state-of-the-art systems based on convolutional and recurrent neural networks are implemented to provide strong baselines
  • 5.
    The Process ofCreating The Data Set
  • 6.
    Task 1 &2 ● (Task 1) Given a section, annotators are asked to generate a question, ● (Task 2) Given the same section and highlighted previously used sentences, the annotated are asked to generate another question, ● The annotators are provided the instructions, the topic, the article title, the section title, and the list of numbered sentences in the section ● The question should be supported by one or more sentences in the paragraph Observation: annotators tend to generate questions with some lexical overlap with the corresponding contexts.
  • 7.
    Task 3 ● Giventhe context and the previously generated questions, the annotators are asked to paraphrase the question. ● A necessary step in creating a corpus that evaluates reading comprehension rather than ability to model word co-occurrences. Observation: a significant drop in word co-occurrence
  • 8.
    Task 4 Observation: Despitethe high quality of the question constructed in 1-3 Tasks, some of questions can be only answered having the additional context Example: “How were the initial reviews?” ● ElasticSearch used to select suspicious questions ● The selected question are sent back to Amazon Mechanical Turk
  • 9.
    The Process ofCreating The Data Set
  • 10.
    An example ofthe created question within tasks 1-4
  • 11.
  • 12.
    Answer Triggering DataSet - Task 5 ● Automatically generated using the previously generated questions ● Elastic used to index and query each question from the entire Wikipedia index (~14 milion sections) ● For each question, top5 highest relevant sections are selected no matter it contains the answer or not. ● In result, 40.76% of the questions have corresponding answer contexts comparing to 39.25% in the WikiQA data set.
  • 13.
    Neural Network approachesused for evaluation ● Two systems using convolutional neural network and recurrent neural network are used to evaluate ● Additionally, we propose a subtree matching mechanism for measuring contextual similarity between two sentences (applied with ConvNet system) 1. Convolutional neural network model: a single convolution with max pooling, used as a feature in logistic regression model with several lexical features (including subtree matching features) 2. Recurrent neural network model: GRU-based bidirectional RNN with attention.
  • 14.
    Convolutional Neural Networkwith Logistic Regression
  • 15.
    The subtree matchingmechanism For every common word wi between question q and sentence a, calculate a similarity score based on the similarity of the words’ parents, siblings and children nodes
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    Conclusion & FutureWork ● A new benchmark for selection-based question answering presented ● Several configurations consisting of the state-of-the-art neural network models used to analyze the introduced corpus ● Analysis on various aspects w.r.t different models shown ● More research on providing context-aware qa systems is needed ● We plan to continue working on large scale corpora for open-domain question answering.
  • 21.