280 likes | 292 Views
Multihop Reasoning : Datasets , Models, and Leaderboards. Ashish Sabharwal Allen Institute for AI Workshop on Reasoning for Complex Question Answering (RCQA) AAAI 2019. Talk Outline. Open B ook Q uestion A nswering OpenBookQA: Multi-hop reasoning over text with Partial Context
E N D
Multihop Reasoning:Datasets, Models, and Leaderboards Ashish Sabharwal Allen Institute for AI Workshop on Reasoning for Complex Question Answering (RCQA)AAAI 2019
Talk Outline • Open Book Question Answering • OpenBookQA: Multi-hop reasoning over text with Partial Context • QA via Entailment • MulTeE: multi-layer aggregation of textual entailment states • Designing Leaderboards • Stakeholders, Design choices
Open Book Question Answering A New Challenge for MultihopReasoningwith Partial Context Data+ Models Leaderboard! http://data.allenai.org/OpenBookQA https://leaderboard.allenai.org
What’s an Open Book Exam? Bounded pieceof core knowledge + Common(sense)world knowledge GOAL: Probe deeper understanding, not memorization skills • Application of core principles to new situations
Open Book QA for Machines • 1326 core science facts / principles e.g., metals conduct electricity • Basis of questions in the dataset • Central to scientific explanations • Insufficient by themselves Bounded pieceof core knowledge + Common(sense)world knowledge GOAL: Probe Multihop reasoning capability, with partial context
Example + Core fact(science) Common(sense)world knowledge
Textual QA Spectrum ReadingComprehension Open-EndedQA • Full context • Key Challenge:language understanding • Limited multi-hop reasoning(coref, substitution, …) • No context • Multiple challenges: Knowledge acuisition,Retrieval, Reasoning • Simple methods workannoyingly well • Partial context • Multihopby design • IR- & PMI-hard
What’s in the Dataset? 5,957 multiple-choice questions (4,957 train + 500 dev + 500 test) An “open book” of 1,326 core science facts Mapping from each (Train + Dev) question to its originating “book” fact Auxiliary (noisy) data: Additional common knowledge facts provided by question authors Note: NOT meant to be a substitute for common knowledge!
Analysis: Knowledge Types Need simple but diverse pieces of common(sense) knowledge! belt buckle, madeOf, metal tree, isa, living thing squirrels, eat, nuts telescope, defined as, … adding lemon to milk, causes, milk to break down
Great Start But… Upcoming! v2: 25k+ • v2: • Targeted UI / UX • Early feedback • Corpus guidedchoice of 2ndfact v2: IR- and LM-hard 3 limitations of OpenBookQA v1 • Size (5k training): respectable but could be larger • Multihopnature not strictly enforced • Crowdsourced 2nd facts oftenincomplete, over-complete, … • Baseline hardness w.r.t. pre-neural models • Strong naïve baselines, arrival of OFT, BERT, …
Talk Outline • Open Book Question Answering • OpenBookQA: Multi-hop reasoning over text with Partial Context • QA via Entailment • MulTeE: multi-layer aggregation of textual entailment states • Designing Leaderboards • Stakeholders, Design choices
Multi-Sentence Entailment for QA Motivation Entailment recognized as a core NLP (sub)task with many applications Yet, convincing application to an end-task lacking(at least for modern entailment methods) MulTeE Multihop QA via a neural “wrapper” for single-sentence entailment State-of-the-art** results on OpenBookQA and MultiRC
Why is Combining Entailment Info. Difficult? Even contradictoryas a single sentence! Irrelevant sentencesoften havesignificantoverlap
Aggregating Entailments for QA Hypothesis: Q + A Premise: knowledge sentences Single-sentenceentailment model(e.g., ESIM) Max Baseline Concatenation Baseline Issue: Doesn’t actually“combine" information Issue: Task is differentfrommodel’s training data
Aggregating Entailments for QA: MulTeE • Two key ideas: • Separate Relevance detection from information aggregation • Aggregate information at Multiple Levels of abstraction
MulTeE: Aggregation Module • Choose aggregation operation (“join”) to match representation, e.g.: • Final layer => weighted sum • Cross-attention layer => normalized relevance-wtd. cross-attn. matrix • Embedding layer => relevance-scaled vector concatenation
OpenBookQA Results BERT+ more Entailmentbased QA OpenAITransformer SimpleBaselines Embed.
Talk Outline • Open Book Question Answering • OpenBookQA: Multi-hop reasoning over text with Partial Context • QA via Entailment • MulTeE: multi-layer aggregation of textual entailment states • Designing Leaderboards • Stakeholders, Design choices
Designing Leaderboards Increasingly popular for tracking progress • What result? When? Who*? How*? • Alternate existing mechanism: good old papers / blogs / code ! Should Leaderboards operate differently from the good old system? Why leaderboards? Depends on the stakeholder: • Host [h] : challenge popularity • Submitters [s] : motivation / prestige • Community [c] : consolidated view of results / techniques
Designing Leaderboards: How? HiddenTest set? Allow hiddenSubmitter identity? Allow temporarilyhiddenTechnique? Allow permanentlyhiddenTechnique? Tradeoffs along multiple competing axes: • [h] simplicity of maintenance • [s] barrier to entry • [s] recognition • [s] confidentiality of method • [s/c] timeliness • [c] broad scientific progress: • avoid overfitting to test set • share / build upon successful techniques • know it “exists” out there
Leaderboards: Common Use Cases • Can share paper / arXiv preprint / blog / code, just want to be listed! • Only (?) concern: timeliness • Commercial Interest in Academic Datasets • Must say Who • But can not (never?) reveal How • Under Anonymous Review (multiple months, reject-resubmit cycle, …) • Would like a timestamp + result to put in a paper • But cannot yet reveal Who or How • “No one” has validated method or results (unless code submitted)
Designing Leaderboards: Tradeoffs HiddenTest set? Allow hiddenSubmitter identity? Allow temporarilyhiddenTechnique? Allow permanentlyhiddenTechnique? Multiple competing axes: • [h] simplicity of maintenance • [s] barrier to entry • [s] recognition • [s] confidentiality of method • [s/c] timeliness • [c] broad scientific progress: • avoid overfitting to test set • share / build upon successful techniques • know it “exists” out there
Designing Leaderboards: Tradeoffs HiddenTest set? Allow hiddenSubmitter identity? Allow temporarilyhiddenTechnique? Allow permanentlyhiddenTechnique? Multiple competing axes: • [h] simplicity of maintenance • [s] barrier to entry • [s] recognition • [s] confidentiality of method • [s/c] timeliness • [c] broad scientific progress: • avoid overfitting to test set • share / build upon successful techniques • know it “exists” out there
Summary • OpenBookQA dataset • Multihop reasoning with partial context • v2 coming soon! • MulTeE: an effective model aggregating sentence-level entailment for QA • State-of-the-art results among non-heavy-LM methods(OpenBookQA + MultiRC) • Leaderboards increasingly valuable • Merits thought and discussion among stakeholders