1 / 28

Multihop Reasoning : Datasets , Models, and Leaderboards

Multihop Reasoning : Datasets , Models, and Leaderboards. Ashish Sabharwal Allen Institute for AI Workshop on Reasoning for Complex Question Answering (RCQA) AAAI 2019. Talk Outline. Open B ook Q uestion A nswering OpenBookQA: Multi-hop reasoning over text with Partial Context

roach
Download Presentation

Multihop Reasoning : Datasets , Models, and Leaderboards

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multihop Reasoning:Datasets, Models, and Leaderboards Ashish Sabharwal Allen Institute for AI Workshop on Reasoning for Complex Question Answering (RCQA)AAAI 2019

  2. Talk Outline • Open Book Question Answering • OpenBookQA: Multi-hop reasoning over text with Partial Context • QA via Entailment • MulTeE: multi-layer aggregation of textual entailment states • Designing Leaderboards • Stakeholders, Design choices

  3. Open Book Question Answering A New Challenge for MultihopReasoningwith Partial Context Data+ Models Leaderboard! http://data.allenai.org/OpenBookQA https://leaderboard.allenai.org

  4. What’s an Open Book Exam? Bounded pieceof core knowledge + Common(sense)world knowledge GOAL: Probe deeper understanding, not memorization skills • Application of core principles to new situations

  5. Open Book QA for Machines • 1326 core science facts / principles e.g., metals conduct electricity • Basis of questions in the dataset • Central to scientific explanations • Insufficient by themselves Bounded pieceof core knowledge + Common(sense)world knowledge GOAL: Probe Multihop reasoning capability, with partial context

  6. Example + Core fact(science) Common(sense)world knowledge

  7. Textual QA Spectrum ReadingComprehension Open-EndedQA • Full context • Key Challenge:language understanding • Limited multi-hop reasoning(coref, substitution, …) • No context • Multiple challenges: Knowledge acuisition,Retrieval, Reasoning • Simple methods workannoyingly well • Partial context • Multihopby design • IR- & PMI-hard

  8. What’s in the Dataset? 5,957 multiple-choice questions (4,957 train + 500 dev + 500 test) An “open book” of 1,326 core science facts Mapping from each (Train + Dev) question to its originating “book” fact Auxiliary (noisy) data: Additional common knowledge facts provided by question authors Note: NOT meant to be a substitute for common knowledge!

  9. Analysis: Reasoning Types

  10. Analysis: Knowledge Types Need simple but diverse pieces of common(sense) knowledge! belt buckle, madeOf, metal tree, isa, living thing squirrels, eat, nuts telescope, defined as, … adding lemon to milk, causes, milk to break down

  11. Great Start But… Upcoming! v2: 25k+ • v2: • Targeted UI / UX • Early feedback • Corpus guidedchoice of 2ndfact v2: IR- and LM-hard 3 limitations of OpenBookQA v1 • Size (5k training): respectable but could be larger • Multihopnature not strictly enforced • Crowdsourced 2nd facts oftenincomplete, over-complete, … • Baseline hardness w.r.t. pre-neural models • Strong naïve baselines, arrival of OFT, BERT, …

  12. Talk Outline • Open Book Question Answering • OpenBookQA: Multi-hop reasoning over text with Partial Context • QA via Entailment • MulTeE: multi-layer aggregation of textual entailment states • Designing Leaderboards • Stakeholders, Design choices

  13. Multi-Sentence Entailment for QA Motivation Entailment recognized as a core NLP (sub)task with many applications Yet, convincing application to an end-task lacking(at least for modern entailment methods) MulTeE Multihop QA via a neural “wrapper” for single-sentence entailment State-of-the-art** results on OpenBookQA and MultiRC

  14. Why is Combining Entailment Info. Difficult? Even contradictoryas a single sentence! Irrelevant sentencesoften havesignificantoverlap

  15. Aggregating Entailments for QA Hypothesis: Q + A Premise: knowledge sentences Single-sentenceentailment model(e.g., ESIM) Max Baseline Concatenation Baseline Issue: Doesn’t actually“combine" information Issue: Task is differentfrommodel’s training data

  16. Aggregating Entailments for QA: MulTeE • Two key ideas: • Separate Relevance detection from information aggregation • Aggregate information at Multiple Levels of abstraction

  17. MulTeE: Aggregation Module • Choose aggregation operation (“join”) to match representation, e.g.: • Final layer => weighted sum • Cross-attention layer => normalized relevance-wtd. cross-attn. matrix • Embedding layer => relevance-scaled vector concatenation

  18. OpenBookQA Results BERT+ more Entailmentbased QA OpenAITransformer SimpleBaselines Embed.

  19. Talk Outline • Open Book Question Answering • OpenBookQA: Multi-hop reasoning over text with Partial Context • QA via Entailment • MulTeE: multi-layer aggregation of textual entailment states • Designing Leaderboards • Stakeholders, Design choices

  20. Designing Leaderboards Increasingly popular for tracking progress • What result? When? Who*? How*? • Alternate existing mechanism: good old papers / blogs / code ! Should Leaderboards operate differently from the good old system? Why leaderboards? Depends on the stakeholder: • Host [h] : challenge popularity • Submitters [s] : motivation / prestige • Community [c] : consolidated view of results / techniques

  21. Designing Leaderboards: How? HiddenTest set? Allow hiddenSubmitter identity? Allow temporarilyhiddenTechnique? Allow permanentlyhiddenTechnique? Tradeoffs along multiple competing axes: • [h] simplicity of maintenance • [s] barrier to entry • [s] recognition • [s] confidentiality of method • [s/c] timeliness • [c] broad scientific progress: • avoid overfitting to test set • share / build upon successful techniques • know it “exists” out there

  22. Leaderboards: Common Use Cases • Can share paper / arXiv preprint / blog / code, just want to be listed! • Only (?) concern: timeliness • Commercial Interest in Academic Datasets • Must say Who • But can not (never?) reveal How • Under Anonymous Review (multiple months, reject-resubmit cycle, …) • Would like a timestamp + result to put in a paper • But cannot yet reveal Who or How • “No one” has validated method or results (unless code submitted)

  23. Designing Leaderboards: Tradeoffs HiddenTest set? Allow hiddenSubmitter identity? Allow temporarilyhiddenTechnique? Allow permanentlyhiddenTechnique? Multiple competing axes: • [h] simplicity of maintenance • [s] barrier to entry • [s] recognition • [s] confidentiality of method • [s/c] timeliness • [c] broad scientific progress: • avoid overfitting to test set • share / build upon successful techniques • know it “exists” out there

  24. Designing Leaderboards: Tradeoffs HiddenTest set? Allow hiddenSubmitter identity? Allow temporarilyhiddenTechnique? Allow permanentlyhiddenTechnique? Multiple competing axes: • [h] simplicity of maintenance • [s] barrier to entry • [s] recognition • [s] confidentiality of method • [s/c] timeliness • [c] broad scientific progress: • avoid overfitting to test set • share / build upon successful techniques • know it “exists” out there

  25. Summary • OpenBookQA dataset • Multihop reasoning with partial context • v2 coming soon! • MulTeE: an effective model aggregating sentence-level entailment for QA • State-of-the-art results among non-heavy-LM methods(OpenBookQA + MultiRC) • Leaderboards increasingly valuable • Merits thought and discussion among stakeholders

  26. EXTRA SLIDES

More Related