520 likes | 607 Views
F rom U nstructured I nformation t o L inked D ata. Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012. Motivation. Motivation. Where does the LOD Cloud come from ? Structured data Triplify , D2R Semi- structured data DBpedia Unstructured data
E N D
FromUnstructured InformationtoLinked Data Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16th 2012
Motivation • Wheredoesthe LOD Cloudcomefrom? • Structured data • Triplify, D2R • Semi-structureddata • DBpedia • Unstructureddata • ??? • Unstructureddatamakeup 80% ofthe Web • How do weextractLinked Data fromunstructureddatasources?
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion NB: Will bemainlyconcernedwiththenewestdevelopments.
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion
Problem Definition • Simple(?) problem: given a textfragment, retrieve • All entitiesand • relationsbetweentheseentitiesautomatically plus • „groundthem“ in an ontology • Also coinedKnowledgeExtraction • John Petrucci was born in New York. :John_Petruccidbo:birthPlace :New_York . :New_York :John_Petrucci • dbo:birthPlace
Problems 1. Findingentities NamedEntity Recognition 2. Findingrelationinstances Relation Extraction 3. Finding URIs URI Disambiguation
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion
NamedEntity Recognition • Problem definition: Given a setofclasses, find all stringsthatarelabelsofinstancesoftheseclasseswithin a textfragment • John Petrucci was born in New York. • [John Petrucci, PER] was born in [New York, LOC].
NamedEntity Recognition • Problem definition: Given a setofclasses, find all stringsthatarelabelsofinstancesoftheseclasseswithin a textfragment • Common setsofclasses • CoNLL03: Person, Location, Organization, Miscelleaneous • ACE05: Facility, Geo-Political Entity, Location, Organisation, Person, Vehicle, Weapon • BioNLP2004: Protein, DNA, RNA, cellline, cell type • Severalapproaches • Directsolutions (singlealgorithms) • Ensemble Learning
NER: Overviewofapproaches • Dictionary-based • Hand-crafted Rules • Machine Learning • Hidden Markov Model (HMMs) • Conditional Random Fields (CRFs) • Neural Networks • k NearestNeighbors (kNN) • Graph Clustering • Ensemble Learning • Veto-Based (Bagging, Boosting) • Neural Networks
NER: Dictionary-based • Simple Idea • Definemappingsbetweenwordsandclasses, e. g., Paris Location • Try tomatcheachtokenfromeachsentence • Return themappingentities • Time-Efficientatruntime • Manuel creationofgazeteers • Low Precision (Paris = Person, Location) • Low Recall (esp. on PersonsandOrganizationsasthenumberofinstancesgrows)
NER: Rule-based • Simple Idea • Define a setofruleto find entities, e.g., [PERSON] was born in [LOCATION]. • Try tomatcheachsentencetooneorseveralrules • Return themappingentities • High precision • Manuel creationofrulesisverytedious • Low recall (finite numberofpatterns)
NER: Markov Models • Stochastic process such that (Markov Property) ) = ) • Equivalentto finite-statemachine • Formallyconsistsof • Set S ofstates S1, … , Sn • Matrix M such thatmij = P(Xt+1=Sj|Xt=Si)
NER: Hidden Markov Models • Extension ofMarkov Models • States arehiddenandassigned an outputfunction • Onlyoutputisseen • Transitionsarelearnedfromtrainingdata • How do theywork? • Input: Discretesequenceoffeatures(e.g., POS Tags, wordstems, etc.) • Goal: Find thebestsequenceofstatesthatrepresenttheinput • Output: hopefullyrightclassificationofeachtoken S0 PER S1 _ … Sn LOC
NER: k NearestNeighbors • Idea • Describeeachtokenqfrom a labelledtrainingdatasetwith a setoffeatures (e.g., leftandrightneigbors) • Eachnewtokentisdescribedwiththe same features • Assignttheclassofits k nearestneighbors
NER: So far … • „Simple approaches“ • Applyonealgorithmtothe NER problem • Boundtobe limited byassumptionsofmodel • Implementedby a large numberoftools • Alchemy • Stanford NER • Illinois Tagger • Ontos NER Tagger • LingPipe • …
NER: Ensemble Learning • Intuition: Eachalgorithmhasitsstrengthsandweaknesses • Idea: Useensemblelearningtomergeresultsof different algorithms so astocreate a meta-classifierofhigheraccuracy Pattern-basedapproaches Dictionary-basedapproaches Support Vector Machines Condition Random Fields
NER: Ensemble Learning • Idea: Mergetheresultsofseveralapproachesforimprovingresults • Simplestapproaches: • Voting • Weightedvoting Output Merger System 1 System 2 System n Input
NER: Ensemble Learning • Whendoesitwork? • Accuracy • Need forexisitingsolutionstobe „good“ • Mergingrandomresultsleadtorandomresults • Given, currentapproachesreach 80% F-Score • Diversity • Need forsmallestpossibleamountofcorrelationbetweenapproaches • E.g., mergingtwo HMM-basedtaggerswon‘thelp • Given, large numberofapproachesfor NER
NER:FOX • FederatedKnowledgeExtraction Framework • Idea: Applyensemblelearningto NER • Classicalapproach: Voting • Does not makeuseofsystematicerror • Partlydifficulttotrain • Useneuralnetworksinstead • Can makeuseofsystematicerrory • Easy totrain • Converge fast • http://fox.aksw.org
NER: FOX on Companies and Countries • Noruntimeissues(parallel implementation) • NN overheadissmall • Overfitting
NER: Summary • Large numberofapproaches • Dictionaries • Hand-Craftedrules • Machine Learning • Hybrid • … • Combiningapproachesleadstobetterresultsthansinglealgorithms
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion
RE: Problem Definition • Find the relations between NEs if such relations exist. • NEs not always given a-priori (open vs. closed RE) • John Petrucci was born in New York. • [John Petrucci, PER] was born in [New York, LOC]. • bornIn ([John Petrucci, PER], [New York, LOC]).
RE: Approaches • Hand-craftedrules • Pattern Learning • Coupled Learning
RE: Pattern-based • Hearst patterns [Hearst: COLING‘92] • POS-enhanced regular expression matching in natural-language text NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn NP0 {,}{NP1, NP2, … NPn-1}{,} or otherNPn “The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.” isA(“Bambara ndang”, “bow lute”) • Time-Efficientatruntime • Verylowrecall • Not adaptabletootherrelations
RE: DIPRE • DIPRE = Dual Iterative Pattern Relation Extraction • Semi-supervised, iterative gathering of facts and patterns • Positive & negative examples as seeds for a given target relation • e.g. +(Hillary, Bill); +(Carla, Nicolas); –(Larry, Google) • Various tuning parameters for pruning low-confidence patterns and facts • Extended to SnowBall / QXtract (Hillary, Bill) X and her husband Y X and Y on theirhoneymoon (Carla, Nicolas) (Angelina, Brad) (Victoria, David) (Hillary, Bill) (Carla, Nicolas) X and Y andtheirchildren X hasbeendatingwith Y X loves Y (Larry, Google) …
RE: NELL • Never-Ending Language Learner (http://rtw.ml.cmu.edu/) • Open IE withontological backbone • Closed set of categories & typed relations • Seeds/counter seeds (5-10) • Open set of predicate arguments(instances) • Coupled iterative learners • Constantly running over a large Web corpus since January 2010 (200 Mio pages) • Periodic human supervision athletePlaysForTeam (Athlete, SportsTeam) athletePlaysForTeam (Alex Rodriguez, Yankees) athletePlaysForTeam (Alexander_Ovechkin, Penguins)
RE: NELL Conservativestrategy AvoidSemantic Drift
RE: BOA • Bootstrapping Linked Data (http://boa.aksw.org) • Core idea: Useinstancedata in Data Web todiscover NL patternsandnewinstances
RE: BOA • Followsconservativestrategy • Only top pattern • Frequencythreshold • Score Threshold • Evaluation results
RE: Summary • Severalapproaches • Hand-craftedrules • Machine Learning • Hybrid • Large numberofinstancesavailableformanyrelations • Runtimeproblem Parallel implementations • Manynewfactscanbefound • Semantic Drift • Long tail • EntityDisambiguation
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion
ED: Problem Definition • Given(a) refenceknowledgebase(s), a textfragment, a listof NEs (incl. position), and a list a relations, find URIs foreachofthe NEs andrelations • Verydifficultproblem • Ambiguity, e.g., Paris = Paris Hilton? Paris (France)? • Difficultevenforhumans, e.g., • Paris‘ mayordiedyesterday • Severalsolutions • Indexing • Surface Form • Graph-based
ED: Problem Definition • John Petrucci was born in New York. :John_Petruccidbo:birthPlace :New_York . • [John Petrucci, PER] was born in [New York, LOC]. • bornIn ([John Petrucci, PER], [New York, LOC]).
ED: Indexing • More retrievalthandisambiguation • Similartodictionary-basedapproaches • Idea • Index all labels in referenceknowledgebase • Given an inputlabel, retrieveall entitieswith a similarlabel • Poor recall (unknownsurface form, e.g., „Mme Curie“ für „Marie Curie“) • Low precision (Paris = Paris Hilton, Paris (France), …)
ED: Type Disambiguation • Extension ofindexing • Index all labels • Infer type information • Retrievelabelsfromentitiesofthegiven type • Same recallaspreviousapproach • Higher precision • Paris[LOC] != Paris[PER] • Still, Paris (France) vs. Paris (Ontario) • Need forcontext
ED: Spotlight • Knownsurfaceforms (http://dbpedia.org/spotlight) • Based on DBpedia + Wikipedia • Usessupplementaryknowledgeincludingdisambiguationpages, redirects, wikilinks • Threemainsteps • Spotting: FindingpossiblementionsofDBpediaresources, e.g.,John Petrucci was born in New York. • CandidateSelection: Find possible URIs, e.g.,John Petrucci :JohnPetrucciNew York :New_York, :New_York_County, … • Disambiguation: MapcontexttovectorforeachresourceNew York :New_York
ED: YAGO2 • Joint Disambiguation ♬ Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.
ED: YAGO2 sim(cxt(ml ),cxt(ei)) coh(ei,ej) Entity Candidates Mentions of Entities Mississippi (Song) Mississippi (State) Bob Dylan Songs Sheryl Cruz Sheryl Lee Sheryl Crow prior(ml ,ei ) Objective: Maximize objective function (e.g., total weight) Constraint: Keep at least one entity per mention
ED: FOX • Generic Approach • A-priori score (a): Popularityof URIs • Similarity score (s): Similarityofresourcelabelsandtext • Coherence score (z): Correlationbetween URIs a|s z a|s
ED:FOX • Allowstheuseofseveralalgorithms • HITS • Pagerank • Apriori • Propagation Algorithms • …
ED: Summary • Difficultproblemevenforhumans • Severalapproaches • Simple search • Search withrestrictions • Knownsurfaceforms • Graph-based • Improved F-Score forDBpedia (70-80%) • Low F-Score forgenericknowledgebases • Intrinsicallydifficult • Still a lotto do
Overview • Problem Definition • NamedEntity Recognition • Algorithms • Ensemble Learning • Relation Extraction • General approaches • OpenIEapproaches • EntityDisambiguation • URI Lookup • Disambiguation • Conclusion