320 likes | 404 Views
SOFIE: A Self-Organizing Framework for Information Extraction. Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum (Max-Planck-Institute for Informatics, Saarbr ü cken, Germany ) . Ontologies. Entity. subclassOf. subclassOf. Singer. Country. type. DBpedia, YAGO, KYLIN,. type.
E N D
SOFIE: A Self-Organizing Framework for Information Extraction Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum (Max-Planck-Institute for Informatics, Saarbrücken, Germany) SOFIE: A Self-Organizing Framework for Information Extraction
Ontologies SOFIE: A Self-Organizing Framework for Information Extraction Entity subclassOf subclassOf Singer Country type DBpedia, YAGO, KYLIN, ... type Wikipedia bornInPlace USA ? birth-place: USA "Elvis died in England" Internet
Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Goal: Extract ontological information from natural language documents diedInPlace England "Elvis died in England" Previous approaches: Espresso, DIPRE, LEILA, Snowball, TextRunner, Alice, and many more ر May deliver non-canonic relations died in, perished in, was killed in,... ر May deliver non-canonic entities England, UK, Great Britain, ... ر May deliver inconsistent facts diedInPlace(Elvis,England) diedInPlace(Elvis,Germany)
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Ontology Web page Elvis died in England. diedInPlace France Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. "died in" = diedInPlace
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Ontology Web page Elvis died in England. Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. "died in" = diedInPlace If a meaningful pattern occurs with two entities, then the entities stand in the relation. diedInPlace "Elvis" "England"
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Ontology Web page ? Taxidophobist Elvis died in England. Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. "died in" = diedInPlace If a meaningful pattern occurs with two entities, then the entities stand in the relation. diedInPlace "Elvis" "England"
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Web page Reasoning Problem Elvis died in England. Taxidophobist Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. "died in" = diedInPlace If a meaningful pattern occurs with two entities, then the entities stand in the relation. diedInPlace "Elvis" "England"
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Web page Reasoning Problem Elvis died in England. Taxidophobist Louis XIV died in France. If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation. Disambiguation Problem "died in" = diedInPlace If a meaningful pattern occurs with two entities, then the entities stand in the relation.
Pitfalls of Information Extraction SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem Reasoning Problem Taxidophobist Elvis died in England. Louis XIV died in France. "died in" = diedInPlace ? Disambiguation Problem
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Reasoning Problem Taxidophobist type(Elvis,Taxidophobist). type(X,Taxidophobist) & bornInPlace(X,Y) => diedInPlace(X,Z) [0.8]
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem Reasoning Problem type(Elvis,Taxidophobist). Elvis died in England. type(X,Taxidophobist) & bornInPlace(X,Y) => diedInPlace(X,Z) Louis XIV died in France. "died in" = diedInPlace ? Disambiguation Problem
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Assumptions: رIn one document, the same word has always the same meaning رThe ontology already knows all important meanings of proper names possibleMeaning(Elvis@D15, ElvisPresley). [0.7] Disambiguation Problem
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Assumptions: رIn one document, the same word has always the same meaning رThe ontology already knows all important meanings of proper names possibleMeaning(Elvis@D15, ElvisPresley). [0.7] Prior estimation for the likelihood of this meaning. A word in context (wic). Here: The word "Elvis" in document D15 | words(D15) ∩ rel(ElvisPresley)| One possible meaning of "Elvis" as given by the ontology | words(D15) |
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Assumptions: رIn one document, the same word has always the same meaning رThe ontology already knows all important meanings of proper names possibleMeaning(Elvis@D15, ElvisPresley). [0.7] possibleMeaning(X,Y) => means(X,Y) means(X,Y) & YZ => means(X,Z)
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem Reasoning Problem type(Elvis,Taxidophobist). Elvis died in England. type(X,Taxidophobist) & bornInPlace(X,Y) => diedInPlace(X,Z) Louis XIV died in France. "died in" = diedInPlace ? Disambiguation Problem meaning(Elvis@D15, ElvisPresley). [0.7]
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem occurs("died in", Elvis@D15, England@D15). [14] Elvis died in England. Louis XIV died in France. "died in" = diedInPlace ? occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & R(X,Y) => mapsTo(P,R) occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & mapsTo(P,R) => R(X,Y)
Information Extraction as Formulas SOFIE: A Self-Organizing Framework for Information Extraction Pattern Matching Problem Reasoning Problem type(Elvis,Taxidophobist). occurs("died in", Elvis@D15, England@D15). [14] type(X,Taxidophobist) & bornInPlace(X,Y) => diedInPlace(X,Z) Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized means(Elvis@D15, ElvisPresley) ? mapsTo("died In", diedInPlace) ? diedIn(ElvisPresley, England) ? Disambiguation Problem meaning(Elvis@D15, ElvisPresley). [0.7]
Weighted MAX SAT Problem SOFIE: A Self-Organizing Framework for Information Extraction Weighted MAX SAT Problem Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized Problems: رThe Weighted MAX SAT Problem is NP-hard رOur instance of the problem is huge ر The most popular linear approximation algorithm (Johnson's) does not work well with our type of formulas bornInPlace(X,Y) => bornInPlace(X,Z) A v B A v C B v C Johnson's cannot approximate better than 2/3
FMS Algorithm The Functional MAX SAT Algorithm considers only unit clauses. Formulas Hypotheses A v B [w1] A v B [w2] B v C [w3] C [w4] = false A B C = false = true The Functional MAX SAT Algorithm propagates Dominating Unit Clauses A v B [10] A [10] A [30] 30 > 10+10 A = true SOFIE: A Self-Organizing Framework for Information Extraction
FMS Algorithm Polynomial time FMS Algorithm FOR i=1 TO 42 ... NEXT i Approximation Guarantee Experiments show better performance in practice than Johnson's algorithm in our setting . SOFIE: A Self-Organizing Framework for Information Extraction
FMS Algorithm Elvis died in England r(X,Y) & s(Y) => t(X,Y) FMS Algorithm FOR i=1 TO 42 ... NEXT i SOFIE: A Self-Organizing Framework for Information Extraction
FMS Algorithm Elvis died in England r(X,Y) & s(Y) => t(X,Y) type(Elvis,Taxidophobist)=1 diedIn(Elvis,England)=0 FMS Algorithm FOR i=1 TO 42 ... NEXT i means(Elvis@D15,Elvis)=0 means(Elvis@D15,...)=1 diedIn England St. Elvis SOFIE: A Self-Organizing Framework for Information Extraction
FMS Algorithm r(X,Y) & s(Y) => t(X,Y) FMS Algorithm FOR i=1 TO 42 ... NEXT i diedIn England St. Elvis SOFIE: A Self-Organizing Framework for Information Extraction
Other Experiments SOFIE: A Self-Organizing Framework for Information Extraction
Conclusion SOFIE unifies the tasks of رentity disambiguation رpattern extraction رsemantic constraint reasoning in a single framework, delivering رcanonicalized facts رof high precision (experiments show 90% precision) died in England... but is alive! SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE rules! R(X,Y) /\ R(X,Z) /\ type(R,function) => Y = Z occurs(P,WX,WY) /\ refersTo(WX.X) /\ refersTo(WY,Y) /\ R(X,Y) => expresses(P,R) occurs(P,WX,WY) /\ expressed(P,R) /\ refersTo(WX.X) /\ refersTo(WY,Y) /\ range(R,D1) /\ domain(R,D2) /\ type(X,D1) /\ type(Y,D2) => R(X,Y) disambiguationPrior(W,X) => refersTo(W,X) R(X,Y) bornInYear(X,B) /\ diedInYear(X,D) => B<D SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE: Experiments SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE: Large-Scale Experiment Corpus: 3700 biography documents downloaded from the Web Goal: Extract bornIn, bornOnDate, diedIn, diedOnDate, politicianOf Results: (precision in %) Runtime: (summed over 5 batches) Parsing 7:05h Hypothesis Generation 6:15h Solving 2:30h Total 15:50h 87 87 13 98 95 90 bornIn bornOnD diedIn diedOnD polOf SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE: Relation to Markov Logic Number of satisfied instances of the ith formula Weight of the ith formula r(x,y) /\ s(x,z) => t(x,z) [w] ... P(X) ~ e sat(i,X) wi max X e sat(i,X) wi P max X log( e sat(i,X) wi ) max X sat(i,X) wi false true bornIn(Nicholas, Patras) ~~~~> Weighted MAX SAT problem SOFIE: A Self-Organizing Framework for Information Extraction
Grounding SOFIE: A Self-Organizing Framework for Information Extraction r(X,Y) & s(Y) => t(X,Y) Immutable, complete facts (e.g. pattern occurrences) { r(X,Y), s(Y), t(X,Y) } r(a,a) Entities={a,b} r(a,b) r(b,a) r(b,b) { r(a,a), s(a), t(a,a) } { r(a,b), s(b), t(a,b) } { r(b,a), s(a), t(b,a) } { r(b,b), s(b), t(b,b) }
Grounding SOFIE: A Self-Organizing Framework for Information Extraction r(X,Y) & s(Y) => t(X,Y) Immutable, complete facts (e.g. pattern occurrences) { r(X,Y), s(Y), t(X,Y) } r(a,a) [w] r(a,b) r(b,a) r(b,b) { s(a), t(a,a) } [w]
Grounding SOFIE: A Self-Organizing Framework for Information Extraction { s(a), t(a,a) } [w1] {p(c,d), q(e), } [w2] Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized means(Elvis@D15, ElvisPresley) = true ? mapsTo("died In", diedInPlace) = true ? diedIn(ElvisPresley, England) = true ?