QuASI: Question Answering using Statistics, Semantics, and Inference

QuASI:Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley / ICSI / Stanford University

Outline • Project Overview • Three topics: • Assigning semantic relations via lexical hierarchies • From sentences to meanings via syntax • From text analysis to inference using conceptual schemas

Main Goals Support Question-Answering and NLP in general by: • Deepening our understanding of concepts that underlie all languages • Creating empirical approaches to identifying semantic relations from free text • Developing probabilistic inferencing algorithms

Two Main Thrusts • Text-based: • Use empirical corpus-based techniques to extract simple semantic relations • Combine these relations to perform simple inferences • “statistical semantic grammar” • Concept-based: • Determine language-universal conceptual principles • Determine how inferences are made among these

Relation Recognition Abbreviation Definition Recognition Semantic Relation Identification

UCB, Sept-Nov, 2002 • Abbreviation Definition Recognition • Developed and evaluated new algorithm • Better results than existing approaches • Simpler and faster as well • Semantic Relation Identification • Developed syntactic chunker • Analyzed sample relations • Began development of a new computational model • Incorporates syntax and semantic labels • Test example: identify “treatment for disease”

Abbreviation Examples • “Heat-shock protein 40 (Hsp40) enables Hsp70 to play critical roles in a number of cellular processes, such as protein folding, assembly, degradation and translocation in vivo.” • “Glutathione S-transferase pull-down experiments showed the direct interaction of in vitro translated p110, p64, and p58 of the essential CBF3 kinetochore protein complex with Cbf1p, a basic region helix-loop-helix zipper protein (bHLHzip) that specifically binds to the CDEI region on the centromere DNA.” • “Hpa2 is a member of the Gcn5-related N-acetyltransferase (GNAT) superfamily, a family of enzymes with diverse substrates including histones, otherproteins,arylalkylamines and aminoglycosides.”

Related Work • Pustejovsky et al. present a solution based on hand-build regular expression and syntactic information. Achieved 72% recall at 98% • Chang et al. use linear regression on a pre-selected set of features. Achieved 83% recall at 80%* precision, and 75% recall at 95% precision. • Park and Byrd present a rule-based algorithm for extraction of abbreviation definitions in general text. • Yoshida et al. present an approach close to ours, trying to first match characters on word and syllable boundaries. * Counting partial matches, and abbreviations missing from the “gold-standard” their algorithm achieved 83% recall at 98% precision.

The Algorithm • Much simpler than other approaches. • Extracts abbreviation-definition candidates adjacent to parentheses. • Finds correct definitions by matching characters in the abbreviation to characters in the definition, starting from the right. • The first character in the abbreviation must match a character at the beginning of a word in the definition. • To increase precision a few simple heuristics are applied to eliminate incorrect pairs. • Example: Heat shock transcription factor (HSF). • The algorithm finds the correct definition, but not the correct alignment: Heat shock transcription factor

Results • On the “gold-standard” the algorithm achieved 83% recall at 96% precision.* • On a larger test collection the results were 82% recall at 95% precision. • These results show that a very simple algorithm produces results that are comparable to these of the exiting more complex algorithms. * Counting partial matches, and abbreviations missing from the “gold-standard” our algorithm achieved 83% recall at 99% precision.

From sentences to meanings via syntax Factored A* Parsing Relational approaches to semantic relations Learning aspectual distinctions

Factored A* Parsing • Goal: develop a lexicalized parser that is fast, accurate and exact[finds the model’s best parse] • Technology exists to get any two, but not all three • Approximate Parsing – Fast but Inexact • Beam or “Best-First” Parsing [Charniak, Collins, etc.] • Factored: represent tree and dependencies separately • Simple, modular, extensible design • Permits fast, high accuracy, exact inference • A* estimates combined from product of experts model • Available from: http://nlp.stanford.edu/[Java, src]

T L D Factored A* Parsing Dependency Accuracy Work Done Labeled Bracketing Accuracy (F1)

Learning Semantic Relations • FrameNet as starting point and training data • Constraint Resolution for Entire Relations • Logical Relations • Probabilistic Models • Combinations of the Two • Bootstrap to new domains • Building blocks for Q/A relevant tasks: • Semantic Roles in Text • Inference • Improved Syntactic Parsing

John has lived in Miami for ten years now. Event Ref-time Event John has lived in Miami before. Learning Aspect:The Perfect • English perfect has experiential, relevant, and durative readings: have been to Bali vs. have just eaten lunch • Disambiguation is necessary for text understanding: John has traveled to Malta [now, or in the past?] • Siegel (2000) looked at inherent but not contextual aspect • Current status: annotation underway for training statistical classifier

Concept-based Analysis From text analysis to inference using conceptual schemas Relational Probabilistic Models Open Domain Conceptual Relations

Hypothesis: Linguistic input is converted into a mental simulation based on bodily-grounded structures. Components: Semantic schemas image schemas and executing schemas are abstractions over neurally grounded perceptual and motor representations Linguistic units lexical and phrasal construction representations invoke schemas, in part through metaphor Inferencelinks these structures and provides parameters for a simulation engine Inference and Conceptual Schemas

Conceptual Schemas • Much is known about conceptual schemas, particularly image schemas • However, this understanding has not yet been formalized • We will develop such a formalism • They have also not been checked extensively against other languages • We will examine Chinese, Russian, and other languages in addition to English

Schema Formalism SCHEMA <name> SUBCASE OF <schema> EVOKES <schema> AS <local name> ROLES < self role name>: <role restriction> < self role name> <-> <role name> CONSTRAINTS <role name> <- <value> <role name> <-> <role name> <setting name> :: <role name> <-> <role name> <setting name> :: <predicate> | <predicate>

A Simple Example SCHEMA hypotenuse SUBCASE OF line-segment EVOKES right-triangle AS rt ROLES Comment inherited from line-segment CONSTRAINTS SELF <-> rt.long-side

Source-Path-Goal SCHEMA: spg ROLES: source: Place path: Directed Curve goal: Place trajector: Entity

Extending Inferential Capabilities • Given the formalization of the conceptual schemas • How to use them for inferencing? • Earlier pilot systems • Used metaphor and Bayesian belief networks • Successfully construed certain inferences • But don’t scale • New approach • Probabilistic relational models • Support an open ontology

A Common Representation • Representation should support • Uncertainty, probability • Conflicts, contradictions • Current plan • Probabilistic Relational Models (Koller et al.) • DAML + OIL

Status of PRM for AQUAINT • Fall 2002 • Designed the basic PRM code-base/infrastructure • Packages for BN’s, OOBN. • Designed PRM inference Algorithm. • Spring-Summer 2003 • Implement the PRM inference Algorithm • Design Dynamic Probabilistic Relational Models (DPRM) • Implement DPRM to replace Pilot System DBN • Test DPRM for QA • Related Work • Probabilistic OWL (PrOWL) • Probabilistic FrameNet

An Open Ontology for Conceptual Relations • Build a formal markup language for conceptual schemas • We propose to use DAML+OIL/OWL as the base. • Advantages of the approach • Common framework for extending and reuse • Closer ties to other efforts within AQUAINT as well as the larger research community on the Semantic Web. • Some Issues • Expressiveness of DAML+OIL • Representing Probabilistic Information • Extension to MetaNet, capture abstract concepts

Current Status • Summer/Fall 2002 • FrameNet-1 is available in DAML+OIL • http://www.icsi.berkeley.edu/~framenet • Image Schemas have been formalized and DAML+OIL representation designed • Initial set of Metaphors and an SQL Metaphor database is in place. • Spring 2003 • Populate Metaphor Database • Populate Image Schema Database • Summer 2003 • Test Inferencing with Image Schemas for QA.

Putting it all Together • We have proposed two different types of semantics • Universal conceptual schemas • Semantic relations • In Phase I they will remain separate • However, we are exploring using PRMs as a common representational format • In later Phases they will be combined

QuASI: Question Answering using Statistics, Semantics, and Inference