1 / 78

Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim Finin June 29, 2010

T2LD – An automatic framework for extracting, interpreting and representing tables as linked data. Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim Finin June 29, 2010. Contribution - Tables to Linked Data. http://dbpedia.org/ontology/PopulatedPlace.

albany
Download Presentation

Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim Finin June 29, 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. T2LD – An automatic framework for extracting, interpreting andrepresenting tables as linked data VarishMulwadMaster’s Thesis DefenseAdvisor: Dr. Tim Finin June 29, 2010

  2. Contribution - Tables to Linked Data http://dbpedia.org/ontology/PopulatedPlace Find Relationships between columns LargestCity Link Cell Value to an entity http://dbpedia.org/resource/Baltimore

  3. Contribution - Tables to Linked Data @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . “City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion . “Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City . … …

  4. A thousand reasons why it’s important… • Generate linked RDF for the Semantic Web • Enrich facts and knowledge that is already existing on the Semantic Web • Add new facts and knowledge in the Semantic Web • Possible use in completing “incomplete tables” • Use in expanding the attributes / columns of a table … and 995 other applications (or more) that will exploit this data

  5. Overview • Introduction • Related Work & Motivation • Tables to linked data • Results • Future Work • Conclusion

  6. Introduction

  7. The World Wide Web … … … … … … … Talk: abc By: xyz Venue: some location … … … … … … Introduction Related Work Tables to Linked Data  Results  Future Work  Conclusion

  8. The World Wide Web … Good for you and me … … not so good for machines Images from http://www.bbc.co.uk/blogs/radiolabs/s5/linked-data/s5.html Introduction Related Work Tables to Linked Data  Results  Future Work  Conclusion

  9. Web of Data – The Semantic Web Image – www.linkeddata.org Introduction Related Work Tables to Linked Data  Results  Future Work  Conclusion

  10. Linked Data Every resource has a URI: Baltimore: http://dbpedia.org/resource/Baltimore The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web. Introduction Related Work Tables to Linked Data  Results  Future Work  Conclusion

  11. Related Work and Motivation

  12. IntroductionRelated Work Tables to Linked Data  Results  Future Work  Conclusion

  13. Chicken ? No – Egg … No – Chicken … • More than a trillion documents on the Web • ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008) • Where is structured data ? IntroductionRelated Work Tables to Linked Data  Results  Future Work  Conclusion

  14. Automate the process • Not practical for humans to encode all this into RDF manually • We need systems that can generate data from existing sources IntroductionRelated Work Tables to Linked Data  Results  Future Work  Conclusion

  15. In Databases and Web Systems … • Understanding tables for Data Integration (Ziegler & Dittrich 2004), (Pantel, Philpot, & Hovy 2005) • Learning to index tables to improve search experience (Cafarella et al. 2008) • Expanding attributes (columns) of web tables (Lin et al. 2010) IntroductionRelated Work Tables to Linked Data  Results  Future Work  Conclusion

  16. On the Semantic Web • Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004) • W3C working group – RDB2RDF !!! • First working draft – June 8, 2010 IntroductionRelated Work Tables to Linked Data  Results  Future Work  Conclusion

  17. On the Semantic Web • Mapping spreadsheets to RDF • Systems like RDF123 (Han et. al 2008) allows users to convert spreadsheets to RDF • Such systems are practical and helpful but … • Require significant manual work • Do not generate linked data IntroductionRelated Work Tables to Linked Data  Results  Future Work  Conclusion

  18. On the Semantic Web • Han et. al 2009, addressed the problem of recommending a set of terms to use to describe the objects and relationships in the table • Did not focus on the overall interpretation of a table • Did not attempt to understand and link cell values IntroductionRelated Work Tables to Linked Data  Results  Future Work  Conclusion

  19. An overall interpretation @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . “City”@en is rdfs:label of dbpedia-owl:City .“State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion . “Baltimore”@en is rdfs:label of dbpedia:Baltimore .dbpedia:Baltimore a dbpedia-owl:City . … … IntroductionRelated Work Tables to Linked Data  Results  Future Work  Conclusion

  20. Tables to Linked Data

  21. T2LD Framework Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  22. An overview Predict Class for Columns Query Knowledge base Input: Table Headers and Rows Re query Knowledge base using the new evidence Link cell value to an entity using the new results obtained Identify Relationships between columns Output: Linked Data IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  23. Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  24. Querying the Knowledge–Base Type Yago Wikitology For every cell from the column – Cell Value + Column Header + Row Content Instance Top N entities, Their Types, Google Page Rank (We use N = 5) IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  25. Querying the Knowledge–Base 1.Baltimore, Types, Page Rank 2. Baltimore County, Maryland, Types, Page Rank 3. John Baltimore, Types, Page Rank 1. Boston, Types, Page Rank 2. Boston_(band), Types, Page Rank 3.Boston_University, Types, Page Rank 1. New_York_City, Types, Page Rank 2. New_York, Types, Page Rank 3. New_York_(album), Types, Page Rank IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  26. Set of Classes Types for John Baltimore Types for Baltimore Types for Baltimore County {yago:AmericanConductors,yago:LivingPeople} {dbpedia-owl:Place, dbpedia-owl:Area} {dbpedia-owl:Place, dbpedia-owl:Area} Types for Boston_band Types for Boston . . . {dbpedia-owl:Place, dbpedia-owl:PopulatedPlace} {dbpedia-owl:Band, dbpedia-owl:Organisation} Set of classes for a column: {dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation, . . . } IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  27. Ranking the Classes Create a pairing of all the class labels and strings in a column [Baltimore, dbpedia-owl:Place] [Boston, dbpedia-owl:Place] [New York, dbpedia-owl:Place] [Baltimore, dbpedia-owl:PopulatedPlace] [Boston, dbpedia-owl:PopulatedPlace] … … [Baltimore, dbpedia-owl:Band] … … IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  28. Ranking the Classes • Assign a score to every pair based on – • The entity’s rank that matches the class label • Predicted Google Page Rank • We use the following formula – • Score = w x ( 1 / R ) + (1 – w) (Normalized Google Page Rank) • We use w = 0.25 IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  29. Ranking the Classes E.g. Processing class – “dbpedia:Area” String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6] (R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4] (R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5] Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank) [Baltimore, dbpedia:Area] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892 E.g. Processing class – “dbpedia:Band” String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6] (R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4] (R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5] [Baltimore, dbpedia:Band] = 0 [Since the class does not match any of the entities for Baltimore] IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  30. Predicting the Classes • Select the class that maximizes its sum of score over the entire column • E.g. Sum of dbpedia:Area • [Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85 • Sum of dbpedia:Band • [Baltimore, dbpedia:Band] + [Boston, dbpedia:Band] + [New York, dbpedia:Band] = 0.25 IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  31. Predicting the Classes • We predict classes from four vocabularies – DBpedia Ontology, Freebase, WordNet and Yago [City, dbpedia:Area] = 1 [City, dbpedia:PoplulatedPlace] = 0.9 [City, dbpedia:Band] = 0.2 [City, yago:LivingPeople] = 0.23 IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  32. The underlying query process …

  33. Mapping Table to Wikipedia IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  34. Mapping Table to Wikipedia Linked Concepts Types Property Values IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  35. Summary of the Query IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  36. Extracting Types from DBpedia Types for Annapolis SPARQL Query Query redirects too … … to avoid disparity in KBs IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  37. Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  38. Approach Requery KB with predicted class labels as additional evidence Generate a feature vector for the top N results of the query Table Cell + Column Header + Row Data + Column Type Classifier decides whether to link or not Classifier ranks the entities within the set of possible results Select the highest ranked entity Link to the top ranked instance Link to “NIL” IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  39. Re-querying KB • Use of predicted class labels as “additional evidence” WordNet:Cityhttp://dbpedia.org/ontology/CityYago:CitiesinUnitedStatesFreebase:Location Class labels are mapped to typesRef field • Restricts the types of the results returned to the predicted class labels IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  40. Summary of the re-query IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  41. Learning to Rank • We trained a SVMrank classifier which learnt to rank entities within a given set Similarity Measures • Levenshtein distance • Dice Score Feature Vector • Wikitology Score • PageRank • Page Length Popularity Measures IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  42. “To Link or not to Link … ’’ • The highest ranked entity may not the correct one to link to … • Because the string we are querying may not be in the KB • Top N results may not include the correct answer • We trained an SVM classifier which would determine whether to link to the top one or not IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  43. “To Link or not to Link … ’’ • Feature vector included the feature vector of the top ranked entity and additional two features – • The SVMrank score of the top ranked entity • The difference in scores between the top two ranked entities IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  44. Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  45. Relation between columns IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  46. Relation between columns Maryland - Baltimore Massachusetts - Boston New York - New York dbonto:LargestCity Candidate relations dbonto:Capitaldbonto:LargestCity dbonto:LargestCity dbonto:Capital dbonto:LargestCity IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  47. Scoring the relations dbonto:LargestCityScore:3 Candidates: dbonto:Capitaldbonto:LargestCity dbonto:CapitalScore:0 dbonto:LargestCity Maryland - Baltimore Massachusetts - Boston New York - New York dbonto:LargestCity dbonto:Capital dbonto:CapitalScore:1 dbonto:LargestCity IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  48. Relation between columns Select * where {<http://dbpedia.org/resource/Maryland> ?relation <http://dbpedia.org/resource/Baltimore> } _______________________________________________________________________ Select * where {<http://dbpedia.org/resource/Maryland> ?relation “Baltimore”@en> } • Query the second column as URI and a literal string • Check all redirects when querying with URI • Check all other common names when querying with literal string IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  49. Input: Table Headers and Rows T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Output: Linked Data Representation of a Table IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

  50. An example @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . @prefix dbpprop: <http://dbpedia.org/property/> . “City”@en is rdfs:label of dbpedia-owl:City . “State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion . “Baltimore”@en is rdfs:label of dbpedia:Baltimore . dbpedia:Baltimore a dbpedia-owl:City . “MD”@en is rdfs:label of dbpedia:Maryland . dbpedia:Maryland a dbpedia-owl:AdministrativeRegion . dbpprop:LargestCityrdfs:domaindbpedia-owl:AdminstrativeRegion . dbpprop:LargestCityrdfs:rangedbpedia-owl:City . • dbpprop:LargestCityrdfs:domaindbpedia-owl:AdminstrativeRegion . • The subjects of the triples using the property have to be instances of dbpedia-owl:AdminstrativeRegion • dbpprop:LargestCityrdfs:rangedbpedia-owl:City . • The objects of the triples using the property have to be instances of dbpedia-owl:City “City”@en is rdfs:label of dbpedia-owl:City . “City” is the common / human name for the class dbpedia-owl:City dbpedia:Baltimore a dbpedia-owl:City . dbpedia:Baltimore is a type (instance) dbpedia-owl:City IntroductionRelated WorkTables to Linked Data Results  Future Work  Conclusion

More Related