Natural Language Questions for the Web of Data

Natural Language Questions for the Web of Data Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany Shady Elbassuoni Qatar Computing Research Institute Maya Ramanath Dept. of CSE, IIT-Delhi, India Volker Tresp Siemens AG, Corporate Technology, Munich, Germany EMNLP 2012

Introduction Natural language question qNL: “Which female actor played in Casablanca and is married to a writer who was born in Rome?” Translation to formal language query qFL(SPARQL1.0) ?x hasGender female ?x isa actor ?x actedIn Casablanca_(film) ?x marriedTo ?w ?w isa writer ?w bornIn Rome But SPARQL is difficult for user Goal automatically translate qNLto qFL

Knowledge Base - Yago2 YAGO2s is a huge semantic knowledge base, derived from Wikipedia, WordNet and GeoNames Relations, Class, Entities special type of entities : strings, numbers, and dates

Framework • DEANNA (DEep Answers for maNy Naturally Asked questions) Construct a disambiguation graph

Phrase detection - Concept detection • Detected phrase pair < phrase, {concept, relation}> • Ex: “Which female actor played in Casablanca and is married to a writer who was born in Rome?”

Phrase detection - Relation detection • rely on a relation detector based on ReVerb (Fader et al., 2011) • with additional POS tag patterns • patterns in dependency parses • a verb and its arguments • adjectives and their arguments • prepositionally modified tokens and objects of prepositions

After Phrase detection got p-node

Phrase Mapping - Concept Mapping Ex: “Which female actor played in Casablanca and is married to a writer who was born in Rome?”

Phrase Mapping - Relation Mapping • Relation Mapping :Use a corpus of textual patterns “Which female actor played in Casablanca and is married to a writer who was born in Rome?”

After Phrase Mapping got semantic items(call s-node) for each p-node could map to

Dependency Parsing • Identifies triples of tokens(triploids) → <trel, targ1, targ2>, trel, targ1, targ2∈qNL • ex: “which writer born in Rome?” det(writer-2, which-1) nsubj(born-3, writer-2) root(ROOT-0, born-3) prep_in(born-3, Rome-5) <born, writer, Rome> born writer Rome

Q-Unit Generation ex: <born, writer, Rome> Detected phrase pair from phrase detection <born, relation> <was born, relation> <writer, concept> <Rome, concept> <{born,was born},{writer},{Rome}>

After Parsing & Q-Unit got q-node and all possible triploids

Framework • DEANNA (DEep Answers for maNy Naturally Asked questions) Choose the resulting subgraph

Resulting subgraph

Joint Disambiguation • Goal: • Let each phrase assigned to at most one semantic item • resolves the phrase boundary ambiguity • (enforcing that only non-overlapping phrases are mapped) • Use disambiguation graph DG to represent these problem, DG = (Vertice,Edge)

Disambiguation graph - Vertices • V = Vs∪Vp∪Vq Vq Vp Vs

Disambiguation graph - Edges • E = Esim∪Ecoh∪Eq • Because s-node divided into relations, entities, and classes, so Esim and Ecoh are computed differently for these three kind Esim Eq Ecoh

Similarity Weights - Esim • For entities • normalized prior score based on how often a phrase refers to a certain entity in Wikipedia • E.g. • If a hyperlink in Wikipedia, it’s anchor text is “Rome”, what’s the probability refers to “Rome” or “Sydne_Rome”

Similarity Weights - Esim • For classes • normalized prior that reflects number of members in a class

Similarity Weights - Esim • For relations • The maximum n-gram similarity between the phrase and any of the relation’s surface forms • E.g. • “was born” into n-gram: {w,a,…,wa,as,…,was,asb,…..,wasborn} also do on “bornOnDate” and “bornIn” , then use Jaccard coefficient to represent there similarity

Semantic Coherence - Ecoh • Capture the semantic coherence between two semantic items as the Jaccard coefficient of their sets of inlinks in Wikipedia’s page • Ecoh • InLink(e) • InLink(c) • InLink(r)

Semantic Coherence – InLink • InLink(e):the set of Yago2 entities whose corresponding Wikipedia pages link to the entity • Ex: InLink(Brad_pitt) = { Al_Pacino, Seal_(musician), Jennifer_Aniston,Tom_Cruise, …}

Semantic Coherence – InLink • InLink(c): ∪e∈c InLinks(e) • Ex: InLink(wikicategory_ 21st-century_actors) = InLink(Joshua_Jackson)∪ InLink(Dennis_Hopper) ∪ InLink(Drew_Barrymore) ∪ …

Semantic Coherence – InLink • InLink(r): ∪(e1, e2) ∈ r (InLinks(e1) ∩InLinks(e2)) • Only consider those that map entities to entities • Ex: InLink(actedIn) • = (InLink(The_Devil’s_Own) ∩InLink(Johnny_Suede) ) ∪ (InLink(The_Devil’s_Own) ∩InLink(Fight_Club) )…

Joint Disambiguation • Disambiguation Graph Processing • The result of disambiguation is a subgraph of the disambiguation graph and then employ an ILP(integer linear program) to this end

Joint Disambiguation - ILP • Definitions

Joint Disambiguation - ILP • Give above definition, objective function is • Hyperparameters (α, β, γ) in the ILP objective function will be tune by use QALD-1 questions in Test set.

Constraints • A p-node can be assigned to one s-node at most • If a p-s similarity edge is chosen, then the respective p-node must be chosen

Constraints (continue) • If s-nodes k and l are chosen, then there are p-nodes mapping to each of the s-nodes k and l • No token can appear as part of two phrases k l

Constraints (continue) • At most one q-edge is selected for a dimension • If the q-edge is chosen then p-node must be selected

Constraints (continue) • Each semantic triple should include a relation

Constraints (continue) • Each triple should have at least one class

Constraints (continue) • Type constraints are respected (through q-edges)

After Joint disambiguation

Query Generation • E.q. Replacing each semantic class with distinct type-constrained variable arg1 rel arg2 ?x type writer ?y type person ?x borIn Rome ?y actedInCasablance ?x married ?y ?x ?y

Evaluation - Datasets • Experiments are based on two collections of questions: • QALD-1: • 1st Workshop on Question Answering over Linked Data • a collection of question in the context of the NAGA project • NAGA collection • linking data from the Yago2 knowledge base QALD-1 NAGA Training set 23 43 Test set 27 44 Total 50 87

Evaluation - Metrics • Three stages • Disambiguation of phrases • Generation of the SPARQL query • Obtaining answers from the underlying linked-data sources • Each stage the output was shown to two human assessors to judge

Evaluation – Human Judge • Disambiguation stage • For each q-node/s-node, whether the mapping was correct or not and any expected mappings were missing

Evaluation – Human Judge(continue) • Query-generation stage • whether the pattern was meaningful for the question or not and any expected triple pattern was missing

Evaluation – Human Judge(continue) • Query-answering stage • identify if the result sets for the generated queries are satisfactory

Result of three stages cov(q, s) = correct(q, s)/ideal(q) prec(q, s) = correct(q, s)/retrieved(q, s) q:question s:item set correct(q, s) : the number of correct items in s ideal(q) : the size of the ideal item set retrieved(q, s) : the number of retrieved items

Result • Result for question, Generated Query, and Sample answer • Generated Query 3 • ?x type actor . ?x bornIn Germany • Sample answers is NONE because of the relation “bornIn”is relate to city.

Result – after relaxing • Relaxing (Elbassuoni et al., 2009) • relaxed query rank using the ranking model in (Elbassuoni et al., 2009) • Generated Query 3 after relaxing • ?x type actor . ?x bornIn ?z[Germany]

Conclusions • Presented a method for translating natural language questions into structured queries. • Core contribution is a framework for disambiguating phrases into semantic items. • Although this model has high combinatorial complexity, but experiment showed very high precision and good coverage of the query translation.

Natural Language Questions for the Web of Data