SelQA: A New Benchmark for Selection-based Question Answering

SelQA: A New Benchmark for
Selection-based Question Answering
Tomasz Jurczyk*, Michael Zhai, Jinho D. Choi
https://github.com/emorynlp/question-answering/
ICTAI 2016
11/8/2016

Selection-based Question Answering
How many airports are in Vietnam ?
- Vietnam operates 21 major civil airports , including three
international gateways: (...)
- Tan Son Nhat is the nation's largest airport, handling (...)
- According to a state - approved plan, Vietnam will have 10
international airports by 2015
- The planned Long Thanh International Airport will have an
annual service capacity of (...)
- (...)
● A ranking problem - selects the exact answer among the candidates
● Single question might have more than 1 correct answers.

Tasks in Selection-based Question Answering
● The original task in question answering
● It originated as a ranking problem
● There is at least a single answer among
candidates
● Measured by MAP and MRR scores.
Answer Sentence Selection
● Recently proposed the advanced version of
ASS task
● The assumption of having at least a single
answer is not present anymore
● Thus, the task can’t be considered a ranking
problem anymore
● Significantly more complex and difficult
● Measured by Precision and Recall.
Answer Triggering

SelQA - A New Benchmark for Question Answering Tasks
● A corpus based on documents of various tasks drawn from Wikipedia
● An effective annotation scheme is proposed to create a large dataset
● Additional annotation for questions w.r.t. topics, types, paraphrases provided
● Two recent state-of-the-art systems based on convolutional and recurrent
neural networks are implemented to provide strong baselines

The Process of Creating The Data Set

Task 1 & 2
● (Task 1) Given a section, annotators are asked to generate a question,
● (Task 2) Given the same section and highlighted previously used sentences,
the annotated are asked to generate another question,
● The annotators are provided the instructions, the topic, the article title, the
section title, and the list of numbered sentences in the section
● The question should be supported by one or more sentences in the paragraph
Observation: annotators tend to generate questions with some lexical overlap
with the corresponding contexts.

Task 3
● Given the context and the previously generated questions, the annotators are
asked to paraphrase the question.
● A necessary step in creating a corpus that evaluates reading comprehension
rather than ability to model word co-occurrences.
Observation: a significant drop in word co-occurrence

Task 4
Observation: Despite the high quality of the question constructed in 1-3 Tasks,
some of questions can be only answered having the additional context
Example: “How were the initial reviews?”
● ElasticSearch used to select suspicious questions
● The selected question are sent back to Amazon Mechanical Turk

An example of the created question within tasks 1-4

Task-wise analysis w.r.t. WikiQA

Answer Triggering Data Set - Task 5
● Automatically generated using the previously generated questions
● Elastic used to index and query each question from the entire Wikipedia index
(~14 milion sections)
● For each question, top5 highest relevant sections are selected no matter it
contains the answer or not.
● In result, 40.76% of the questions have corresponding answer contexts
comparing to 39.25% in the WikiQA data set.

Neural Network approaches used for evaluation
● Two systems using convolutional neural network and recurrent neural network
are used to evaluate
● Additionally, we propose a subtree matching mechanism for measuring
contextual similarity between two sentences (applied with ConvNet system)
1. Convolutional neural network model: a single convolution with max pooling,
used as a feature in logistic regression model with several lexical features
(including subtree matching features)
2. Recurrent neural network model: GRU-based bidirectional RNN with
attention.

Convolutional Neural Network with Logistic Regression

The subtree matching mechanism
For every common word wi
between question q and sentence a, calculate a similarity score
based on the similarity of the words’ parents, siblings and children nodes

Answer Sentence Selection on WikiQA

Answer Sentence Selection on SelQA

Conclusion & Future Work
● A new benchmark for selection-based question answering presented
● Several configurations consisting of the state-of-the-art neural network
models used to analyze the introduced corpus
● Analysis on various aspects w.r.t different models shown
● More research on providing context-aware qa systems is needed
● We plan to continue working on large scale corpora for open-domain question
answering.

SelQA: A New Benchmark for Selection-based Question Answering

More Related Content

What's hot

Similar to SelQA: A New Benchmark for Selection-based Question Answering

More from Jinho Choi

Recently uploaded

SelQA: A New Benchmark for Selection-based Question Answering