320 likes | 472 Views
SOFIE: A Self-Organizing Framework for Information Extraction. Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum (Max-Planck-Institute for Informatics, Saarbr ü cken, Germany ) . Ontologies. Entity. subclassOf. subclassOf. Singer. Country. type. DBpedia, YAGO, KYLIN,. type.
E N D
SOFIE: A Self-Organizing Framework for Information Extraction Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum (Max-Planck-Institute for Informatics, Saarbrücken, Germany) SOFIE: A Self-Organizing Framework for Information Extraction
Ontologies SOFIE: A Self-Organizing Framework for Information Extraction Entity subclassOf subclassOf Singer Country type DBpedia, YAGO, KYLIN, ... type Wikipedia bornInPlace USA ? birth-place: USA "Elvis died in England" Internet
Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Goal: Extract ontological information from natural language documents diedInPlace England "Elvis died in England" recoverWithout(most_people, medication) areUnder(0%, the_age_of_18) support(these_findings, the_notion) Previous approaches: Espresso, DIPRE, LEILA, Snowball, TextRunner, Alice, and many more died in, perished in, was killed in ر May deliver non-canonic relations England, UK, Great Britain ر May deliver non-canonic entities diedInPlace(Elvis, England) diedInPlace(Elvis, Germany) ر May deliver inconsistent facts SOFIE aims to solve these problems in a new unified framework
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Ontology Web page Elvis died in England. diedInPlace France Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. "died in" = diedInPlace
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Ontology Web page Elvis died in England. Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. "died in" = diedInPlace If a meaningful pattern occurs with two entities, then the entities stand in the relation. diedInPlace "Elvis" "England"
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Ontology Web page ? Taxidophobist Elvis died in England. Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. "died in" = diedInPlace If a meaningful pattern occurs with two entities, then the entities stand in the relation. diedInPlace "Elvis" "England"
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Web page Reasoning Problem Elvis died in England. Taxidophobist Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. "died in" = diedInPlace If a meaningful pattern occurs with two entities, then the entities stand in the relation. diedInPlace "Elvis" "England"
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Web page Reasoning Problem Elvis died in England. Taxidophobist Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. Disambiguation Problem "died in" = diedInPlace If a meaningful pattern occurs with two entities, then the entities stand in the relation.
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem Reasoning Problem Taxidophobist Elvis died in England. Louis XIV died in France. "died in" = diedInPlace ? Disambiguation Problem
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Reasoning Problem Taxidophobist type(Elvis,Taxidophobist). type(X,Taxidophobist) & bornInPlace(X,Y) => diedInPlace(X,Z) [0.8]
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem Reasoning Problem type(Elvis,Taxidophobist). Elvis died in England. type(X,Taxidophobist) & bornInPlace(X,Y) => diedInPlace(X,Z) Louis XIV died in France. "died in" = diedInPlace ? Disambiguation Problem
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Assumptions: رIn one document, the same word has always the same meaning رThe ontology already knows all important meanings of proper names possibleMeaning(Elvis@D15, ElvisPresley). [0.7] Disambiguation Problem
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Assumptions: رIn one document, the same word has always the same meaning رThe ontology already knows all important meanings of proper names possibleMeaning(Elvis@D15, ElvisPresley). [0.7] Prior estimation for the likelihood of this meaning. A word in context (wic). Here: The word "Elvis" in document D15 | words(D15) ∩ rel(ElvisPresley)| One possible meaning of "Elvis" as given by the ontology | words(D15) |
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Assumptions: رIn one document, the same word has always the same meaning رThe ontology already knows all important meanings of proper names possibleMeaning(Elvis@D15, ElvisPresley). [0.7] possibleMeaning(X,Y) => means(X,Y) means(X,Y) & YZ => means(X,Z)
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem Reasoning Problem type(Elvis,Taxidophobist). Elvis died in England. type(X,Taxidophobist) & bornInPlace(X,Y) => diedInPlace(X,Z) Louis XIV died in France. "died in" = diedInPlace ? Disambiguation Problem meaning(Elvis@D15, ElvisPresley). [0.7]
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem occurs("died in", Elvis@D15, England@D15). [14] Elvis died in England. Louis XIV died in France. "died in" = diedInPlace ? occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & R(X,Y) => mapsTo(P,R) occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & mapsTo(P,R) => R(X,Y)
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem Reasoning Problem type(Elvis,Taxidophobist). occurs("died in", Elvis@D15, England@D15). [14] type(X,Taxidophobist) & bornInPlace(X,Y) => diedInPlace(X,Z) Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized means(Elvis@D15, ElvisPresley) ? mapsTo("died In", diedInPlace) ? diedIn(ElvisPresley, England) ? Disambiguation Problem meaning(Elvis@D15, ElvisPresley). [0.7]
Weighted MAX SAT Problem SOFIE: A Self-Organizing Framework for Information Extraction Weighted MAX SAT Problem Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized Structurally much simpler than MLNs. No need to model probabilities if we're just interested in the maximum. Problems: رThe Weighted MAX SAT Problem is NP-hard رOur instance of the problem is huge ر The most popular greedy approximation algorithm (Johnson's) does not work well with our type of formulas bornInPlace(X,Y) => bornInPlace(X,Z) A v B A v C B v C Johnson's has upper bound 2/3 on approximation
FMS Algorithm The Functional MAX SAT Algorithm considers only unit clauses. Formulas Hypotheses A v B [w1] A v B [w2] B v C [w3] C [w4] = false A B C = false = true The Functional MAX SAT Algorithm propagates Dominating Unit Clauses A v B [10] A [10] A [30] 30 > 10+10 A = true SOFIE: A Self-Organizing Framework for Information Extraction
FMS Algorithm Polynomial time FMS Algorithm FOR i=1 TO 42 ... NEXT i Approximation Guarantee Experiments show better performance in practice than Johnson's algorithm in our setting . SOFIE: A Self-Organizing Framework for Information Extraction
FMS Algorithm Elvis died in England r(X,Y) & s(Y) => t(X,Y) FMS Algorithm FOR i=1 TO 42 ... NEXT i SOFIE: A Self-Organizing Framework for Information Extraction
FMS Algorithm Elvis died in England r(X,Y) & s(Y) => t(X,Y) type(Elvis,Taxidophobist)=1 diedIn(Elvis,England)=0 FMS Algorithm FOR i=1 TO 42 ... NEXT i means(Elvis@D15,Elvis)=0 means(Elvis@D15,...)=1 diedIn England St. Elvis SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE r(X,Y) & s(Y) => t(X,Y) diedIn England St. Elvis SOFIE: A Self-Organizing Framework for Information Extraction
Other Experiments (All experiments with the YAGO ontology) SOFIE: A Self-Organizing Framework for Information Extraction
Conclusion SOFIE unifies the tasks of رentity disambiguation رpattern extraction رsemantic constraint reasoning in a single framework, delivering رcanonicalized facts رof high precision s(Y) => t(X) died in England... but is alive! http://mpii.de/yago-naga SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE rules! R(X,Y) /\ R(X,Z) /\ type(R,function) => Y = Z occurs(P,WX,WY) /\ refersTo(WX.X) /\ refersTo(WY,Y) /\ R(X,Y) => expresses(P,R) occurs(P,WX,WY) /\ expressed(P,R) /\ refersTo(WX.X) /\ refersTo(WY,Y) /\ range(R,D1) /\ domain(R,D2) /\ type(X,D1) /\ type(Y,D2) => R(X,Y) disambiguationPrior(W,X) => refersTo(W,X) R(X,Y) bornInYear(X,B) /\ diedInYear(X,D) => B<D SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE: Experiments SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE: Large-Scale Experiment Corpus: 3700 biography documents downloaded from the Web Goal: Extract bornIn, bornOnDate, diedIn, diedOnDate, politicianOf Results: (precision in %) Runtime: (summed over 5 batches) Parsing 7:05h Hypothesis Generation 6:15h Solving 2:30h Total 15:50h 87 87 13 98 95 90 bornIn bornOnD diedIn diedOnD polOf SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE: Relation to Markov Logic Number of satisfied instances of the ith formula Weight of the ith formula r(x,y) /\ s(x,z) => t(x,z) [w] ... P(X) ~ e sat(i,X) wi max X e sat(i,X) wi P max X log( e sat(i,X) wi ) max X sat(i,X) wi false true bornIn(Nicholas, Patras) ~~~~> Weighted MAX SAT problem SOFIE: A Self-Organizing Framework for Information Extraction
Grounding SOFIE: A Self-Organizing Framework for Information Extraction r(X,Y) & s(Y) => t(X,Y) Immutable, complete facts (e.g. pattern occurrences) { r(X,Y), s(Y), t(X,Y) } r(a,a) Entities={a,b} r(a,b) r(b,a) r(b,b) { r(a,a), s(a), t(a,a) } { r(a,b), s(b), t(a,b) } { r(b,a), s(a), t(b,a) } { r(b,b), s(b), t(b,b) }
Grounding SOFIE: A Self-Organizing Framework for Information Extraction r(X,Y) & s(Y) => t(X,Y) Immutable, complete facts (e.g. pattern occurrences) { r(X,Y), s(Y), t(X,Y) } r(a,a) [w] r(a,b) r(b,a) r(b,b) { s(a), t(a,a) } [w]
Grounding SOFIE: A Self-Organizing Framework for Information Extraction { s(a), t(a,a) } [w1] {p(c,d), q(e), } [w2] Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized means(Elvis@D15, ElvisPresley) = true ? mapsTo("died In", diedInPlace) = true ? diedIn(ElvisPresley, England) = true ?