780 likes | 954 Views
T2LD – An automatic framework for extracting, interpreting and representing tables as linked data. Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim Finin June 29, 2010. Contribution - Tables to Linked Data. http://dbpedia.org/ontology/PopulatedPlace.
E N D
T2LD – An automatic framework for extracting, interpreting andrepresenting tables as linked data VarishMulwadMaster’s Thesis DefenseAdvisor: Dr. Tim Finin June 29, 2010
Contribution - Tables to Linked Data http://dbpedia.org/ontology/PopulatedPlace Find Relationships between columns LargestCity Link Cell Value to an entity http://dbpedia.org/resource/Baltimore
Contribution - Tables to Linked Data @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . “City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion . “Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City . … …
A thousand reasons why it’s important… • Generate linked RDF for the Semantic Web • Enrich facts and knowledge that is already existing on the Semantic Web • Add new facts and knowledge in the Semantic Web • Possible use in completing “incomplete tables” • Use in expanding the attributes / columns of a table … and 995 other applications (or more) that will exploit this data
Overview • Introduction • Related Work & Motivation • Tables to linked data • Results • Future Work • Conclusion
The World Wide Web … … … … … … … Talk: abc By: xyz Venue: some location … … … … … … Introduction Related Work Tables to Linked Data Results Future Work Conclusion
The World Wide Web … Good for you and me … … not so good for machines Images from http://www.bbc.co.uk/blogs/radiolabs/s5/linked-data/s5.html Introduction Related Work Tables to Linked Data Results Future Work Conclusion
Web of Data – The Semantic Web Image – www.linkeddata.org Introduction Related Work Tables to Linked Data Results Future Work Conclusion
Linked Data Every resource has a URI: Baltimore: http://dbpedia.org/resource/Baltimore The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web. Introduction Related Work Tables to Linked Data Results Future Work Conclusion
IntroductionRelated Work Tables to Linked Data Results Future Work Conclusion
Chicken ? No – Egg … No – Chicken … • More than a trillion documents on the Web • ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008) • Where is structured data ? IntroductionRelated Work Tables to Linked Data Results Future Work Conclusion
Automate the process • Not practical for humans to encode all this into RDF manually • We need systems that can generate data from existing sources IntroductionRelated Work Tables to Linked Data Results Future Work Conclusion
In Databases and Web Systems … • Understanding tables for Data Integration (Ziegler & Dittrich 2004), (Pantel, Philpot, & Hovy 2005) • Learning to index tables to improve search experience (Cafarella et al. 2008) • Expanding attributes (columns) of web tables (Lin et al. 2010) IntroductionRelated Work Tables to Linked Data Results Future Work Conclusion
On the Semantic Web • Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004) • W3C working group – RDB2RDF !!! • First working draft – June 8, 2010 IntroductionRelated Work Tables to Linked Data Results Future Work Conclusion
On the Semantic Web • Mapping spreadsheets to RDF • Systems like RDF123 (Han et. al 2008) allows users to convert spreadsheets to RDF • Such systems are practical and helpful but … • Require significant manual work • Do not generate linked data IntroductionRelated Work Tables to Linked Data Results Future Work Conclusion
On the Semantic Web • Han et. al 2009, addressed the problem of recommending a set of terms to use to describe the objects and relationships in the table • Did not focus on the overall interpretation of a table • Did not attempt to understand and link cell values IntroductionRelated Work Tables to Linked Data Results Future Work Conclusion
An overall interpretation @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . “City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion . “Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City . … … IntroductionRelated Work Tables to Linked Data Results Future Work Conclusion
T2LD Framework Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
An overview Predict Class for Columns Query Knowledge base Input: Table Headers and Rows Re query Knowledge base using the new evidence Link cell value to an entity using the new results obtained Identify Relationships between columns Output: Linked Data IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Querying the Knowledge–Base Type Yago Wikitology For every cell from the column – Cell Value + Column Header + Row Content Instance Top N entities, Their Types, Google Page Rank (We use N = 5) IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Querying the Knowledge–Base 1.Baltimore, Types, Page Rank 2. Baltimore County, Maryland, Types, Page Rank 3. John Baltimore, Types, Page Rank 1. Boston, Types, Page Rank 2. Boston_(band), Types, Page Rank 3.Boston_University, Types, Page Rank 1. New_York_City, Types, Page Rank 2. New_York, Types, Page Rank 3. New_York_(album), Types, Page Rank IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Set of Classes Types for John Baltimore Types for Baltimore Types for Baltimore County {yago:AmericanConductors,yago:LivingPeople} {dbpedia-owl:Place, dbpedia-owl:Area} {dbpedia-owl:Place, dbpedia-owl:Area} Types for Boston_band Types for Boston . . . {dbpedia-owl:Place, dbpedia-owl:PopulatedPlace} {dbpedia-owl:Band, dbpedia-owl:Organisation} Set of classes for a column: {dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . } IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Ranking the Classes Create a pairing of all the class labels and strings in a column [Baltimore, dbpedia-owl:Place] [Boston, dbpedia-owl:Place] [New York, dbpedia-owl:Place] [Baltimore, dbpedia-owl:PopulatedPlace] [Boston, dbpedia-owl:PopulatedPlace] … … [Baltimore, dbpedia-owl:Band] … … IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Ranking the Classes • Assign a score to every pair based on – • The entity’s rank that matches the class label • Predicted Google Page Rank • We use the following formula – • Score = w x ( 1 / R ) + (1 – w) (Normalized Google Page Rank) • We use w = 0.25 IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Ranking the Classes E.g. Processing class – “dbpedia:Area” String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6] (R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4] (R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5] Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank) [Baltimore, dbpedia:Area] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892 E.g. Processing class – “dbpedia:Band” String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6] (R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4] (R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5] [Baltimore, dbpedia:Band] = 0 [Since the class does not match any of the entities for Baltimore] IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Predicting the Classes • Select the class that maximizes its sum of score over the entire column • E.g. Sum of dbpedia:Area • [Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85 • Sum of dbpedia:Band • [Baltimore, dbpedia:Band] + [Boston, dbpedia:Band] + [New York, dbpedia:Band] = 0.25 IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Predicting the Classes • We predict classes from four vocabularies – DBpedia Ontology, Freebase, WordNet and Yago [City, dbpedia:Area] = 1 [City, dbpedia:PoplulatedPlace] = 0.9 [City, dbpedia:Band] = 0.2 [City, yago:LivingPeople] = 0.23 IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Mapping Table to Wikipedia IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Mapping Table to Wikipedia Linked Concepts Types Property Values IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Summary of the Query IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Extracting Types from DBpedia Types for Annapolis SPARQL Query Query redirects too … … to avoid disparity in KBs IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Approach Requery KB with predicted class labels as additional evidence Generate a feature vector for the top N results of the query Table Cell + Column Header + Row Data + Column Type Classifier decides whether to link or not Classifier ranks the entities within the set of possible results Select the highest ranked entity Link to the top ranked instance Link to “NIL” IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Re-querying KB • Use of predicted class labels as “additional evidence” WordNet:Cityhttp://dbpedia.org/ontology/CityYago:CitiesinUnitedStatesFreebase:Location Class labels are mapped to typesRef field • Restricts the types of the results returned to the predicted class labels IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Summary of the re-query IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Learning to Rank • We trained a SVMrank classifier which learnt to rank entities within a given set Similarity Measures • Levenshtein distance • Dice Score Feature Vector • Wikitology Score • PageRank • Page Length Popularity Measures IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
“To Link or not to Link … ’’ • The highest ranked entity may not the correct one to link to … • Because the string we are querying may not be in the KB • Top N results may not include the correct answer • We trained an SVM classifier which would determine whether to link to the top one or not IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
“To Link or not to Link … ’’ • Feature vector included the feature vector of the top ranked entity and additional two features – • The SVMrank score of the top ranked entity • The difference in scores between the top two ranked entities IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Relation between columns IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Relation between columns Maryland - Baltimore Massachusetts - Boston New York - New York dbonto:LargestCity Candidate relations dbonto:Capitaldbonto:LargestCity dbonto:LargestCity dbonto:Capital dbonto:LargestCity IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Scoring the relations dbonto:LargestCityScore:3 Candidates: dbonto:Capitaldbonto:LargestCity dbonto:CapitalScore:0 dbonto:LargestCity Maryland - Baltimore Massachusetts - Boston New York - New York dbonto:LargestCity dbonto:Capital dbonto:CapitalScore:1 dbonto:LargestCity IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Relation between columns Select * where {<http://dbpedia.org/resource/Maryland> ?relation <http://dbpedia.org/resource/Baltimore> } _______________________________________________________________________ Select * where {<http://dbpedia.org/resource/Maryland> ?relation “Baltimore”@en> } • Query the second column as URI and a literal string • Check all redirects when querying with URI • Check all other common names when querying with literal string IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion
An example @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . @prefix dbpprop: <http://dbpedia.org/property/> . “City”@en is rdfs:label of dbpedia-owl:City . “State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion . “Baltimore”@en is rdfs:label of dbpedia:Baltimore . dbpedia:Baltimore a dbpedia-owl:City . “MD”@en is rdfs:label of dbpedia:Maryland . dbpedia:Maryland a dbpedia-owl:AdministrativeRegion . dbpprop:LargestCityrdfs:domaindbpedia-owl:AdminstrativeRegion . dbpprop:LargestCityrdfs:rangedbpedia-owl:City . • dbpprop:LargestCityrdfs:domaindbpedia-owl:AdminstrativeRegion . • The subjects of the triples using the property have to be instances of dbpedia-owl:AdminstrativeRegion • dbpprop:LargestCityrdfs:rangedbpedia-owl:City . • The objects of the triples using the property have to be instances of dbpedia-owl:City “City”@en is rdfs:label of dbpedia-owl:City . “City” is the common / human name for the class dbpedia-owl:City dbpedia:Baltimore a dbpedia-owl:City . dbpedia:Baltimore is a type (instance) dbpedia-owl:City IntroductionRelated WorkTables to Linked Data Results Future Work Conclusion