Multihop Reasoning : Datasets , Models, and Leaderboards

Multihop Reasoning:Datasets, Models, and Leaderboards Ashish Sabharwal Allen Institute for AI Workshop on Reasoning for Complex Question Answering (RCQA)AAAI 2019

Talk Outline • Open Book Question Answering • OpenBookQA: Multi-hop reasoning over text with Partial Context • QA via Entailment • MulTeE: multi-layer aggregation of textual entailment states • Designing Leaderboards • Stakeholders, Design choices

Open Book Question Answering A New Challenge for MultihopReasoningwith Partial Context Data+ Models Leaderboard! http://data.allenai.org/OpenBookQA https://leaderboard.allenai.org

What’s an Open Book Exam? Bounded pieceof core knowledge + Common(sense)world knowledge GOAL: Probe deeper understanding, not memorization skills • Application of core principles to new situations

Open Book QA for Machines • 1326 core science facts / principles e.g., metals conduct electricity • Basis of questions in the dataset • Central to scientific explanations • Insufficient by themselves Bounded pieceof core knowledge + Common(sense)world knowledge GOAL: Probe Multihop reasoning capability, with partial context

Example + Core fact(science) Common(sense)world knowledge

Textual QA Spectrum ReadingComprehension Open-EndedQA • Full context • Key Challenge:language understanding • Limited multi-hop reasoning(coref, substitution, …) • No context • Multiple challenges: Knowledge acuisition,Retrieval, Reasoning • Simple methods workannoyingly well • Partial context • Multihopby design • IR- & PMI-hard

What’s in the Dataset? 5,957 multiple-choice questions (4,957 train + 500 dev + 500 test) An “open book” of 1,326 core science facts Mapping from each (Train + Dev) question to its originating “book” fact Auxiliary (noisy) data: Additional common knowledge facts provided by question authors Note: NOT meant to be a substitute for common knowledge!

Analysis: Reasoning Types

Analysis: Knowledge Types Need simple but diverse pieces of common(sense) knowledge! belt buckle, madeOf, metal tree, isa, living thing squirrels, eat, nuts telescope, defined as, … adding lemon to milk, causes, milk to break down

Great Start But… Upcoming! v2: 25k+ • v2: • Targeted UI / UX • Early feedback • Corpus guidedchoice of 2ndfact v2: IR- and LM-hard 3 limitations of OpenBookQA v1 • Size (5k training): respectable but could be larger • Multihopnature not strictly enforced • Crowdsourced 2nd facts oftenincomplete, over-complete, … • Baseline hardness w.r.t. pre-neural models • Strong naïve baselines, arrival of OFT, BERT, …

Multi-Sentence Entailment for QA Motivation Entailment recognized as a core NLP (sub)task with many applications Yet, convincing application to an end-task lacking(at least for modern entailment methods) MulTeE Multihop QA via a neural “wrapper” for single-sentence entailment State-of-the-art** results on OpenBookQA and MultiRC

Why is Combining Entailment Info. Difficult? Even contradictoryas a single sentence! Irrelevant sentencesoften havesignificantoverlap

Aggregating Entailments for QA Hypothesis: Q + A Premise: knowledge sentences Single-sentenceentailment model(e.g., ESIM) Max Baseline Concatenation Baseline Issue: Doesn’t actually“combine" information Issue: Task is differentfrommodel’s training data

Aggregating Entailments for QA: MulTeE • Two key ideas: • Separate Relevance detection from information aggregation • Aggregate information at Multiple Levels of abstraction

MulTeE: Aggregation Module • Choose aggregation operation (“join”) to match representation, e.g.: • Final layer => weighted sum • Cross-attention layer => normalized relevance-wtd. cross-attn. matrix • Embedding layer => relevance-scaled vector concatenation

OpenBookQA Results BERT+ more Entailmentbased QA OpenAITransformer SimpleBaselines Embed.

Designing Leaderboards Increasingly popular for tracking progress • What result? When? Who*? How*? • Alternate existing mechanism: good old papers / blogs / code ! Should Leaderboards operate differently from the good old system? Why leaderboards? Depends on the stakeholder: • Host [h] : challenge popularity • Submitters [s] : motivation / prestige • Community [c] : consolidated view of results / techniques

Designing Leaderboards: How? HiddenTest set? Allow hiddenSubmitter identity? Allow temporarilyhiddenTechnique? Allow permanentlyhiddenTechnique? Tradeoffs along multiple competing axes: • [h] simplicity of maintenance • [s] barrier to entry • [s] recognition • [s] confidentiality of method • [s/c] timeliness • [c] broad scientific progress: • avoid overfitting to test set • share / build upon successful techniques • know it “exists” out there

Leaderboards: Common Use Cases • Can share paper / arXiv preprint / blog / code, just want to be listed! • Only (?) concern: timeliness • Commercial Interest in Academic Datasets • Must say Who • But can not (never?) reveal How • Under Anonymous Review (multiple months, reject-resubmit cycle, …) • Would like a timestamp + result to put in a paper • But cannot yet reveal Who or How • “No one” has validated method or results (unless code submitted)

Designing Leaderboards: Tradeoffs HiddenTest set? Allow hiddenSubmitter identity? Allow temporarilyhiddenTechnique? Allow permanentlyhiddenTechnique? Multiple competing axes: • [h] simplicity of maintenance • [s] barrier to entry • [s] recognition • [s] confidentiality of method • [s/c] timeliness • [c] broad scientific progress: • avoid overfitting to test set • share / build upon successful techniques • know it “exists” out there

Summary • OpenBookQA dataset • Multihop reasoning with partial context • v2 coming soon! • MulTeE: an effective model aggregating sentence-level entailment for QA • State-of-the-art results among non-heavy-LM methods(OpenBookQA + MultiRC) • Leaderboards increasingly valuable • Merits thought and discussion among stakeholders

EXTRA SLIDES

Multihop Reasoning : Datasets , Models, and Leaderboards

Multihop Reasoning : Datasets , Models, and Leaderboards

Presentation Transcript

Constraint Reasoning for Differential Models

Graphical Models for Strategic and Economic Reasoning

Reasoning about Hardware and Software Memory Models

Multihop wireless networks

Reasoning with Economics: Models and Information

DataSets

Using Models of Reasoning

Datasets and infographics

Flexible Reasoning with Functional Models

Datasets and Variables

Connectionist Models of Analogical Reasoning

Constraint Reasoning for Differential Models

DIAL and Datasets

Reasoning with Infinite stable models

Datasets

Datasets for evaluating climate models and their projections: Obs4MIPs

Analysis of scores, datasets, and models in visual saliency modeling

Datasets

Rakuten Leaderboards

Representation and Reasoning with Graphical Models

Reasoning and Mental Models

Graphical Models for Strategic and Economic Reasoning