460 likes | 585 Views
Using linked data to interpret tables. Varish Mulwad , Tim Finin , Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010. Interpreting a table. http://dbpedia.org/class/yago/NationalBasketballAssociationTeams. dbprop:team.
E N D
Using linked data to interpret tables VarishMulwad, Tim Finin, ZareenSyed and Anupam JoshiUniversity of Maryland, Baltimore County November 8, 2010
Interpreting a table http://dbpedia.org/class/yago/NationalBasketballAssociationTeams dbprop:team http://dbpedia.org/resource/Allen_Iverson Map numbers as values of properties
Interpreting a table @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . @prefix yago: <http://dbpedia.org/class/yago/> . "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer . "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams . "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan . dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer . "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls . dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams .
Use Cases Intelligent querying over data Create a ‘Semantic’ knowledge-base
Use Cases @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . @prefix yago: <http://dbpedia.org/class/yago/> . "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer . "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams . "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan . dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer . "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls . dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams . • Data Integration • Search / Query over tables Convert legacy data into Semantic Web formats Confirm/Verify existing knowledgeAdd new knowledge to the LOD cloud
We are laying a strong foundation for the Semantic Web … … but an old problem haunts us …
Chicken ? Egg ? … No Chicken ? • ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008) • 305,632 Datasets available as CSV or spreadsheets on Data.gov (US) + 7 Other nations establishing open data • Where is structured data ?
Automate the process • Not practical for humans to encode all this into RDF manually • We need systems that can generate data from existing sources
Related Work • Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004) • Mapping Relational databases to RDF [W3C working group – RDB2RDF]
Related Work • Mapping spreadsheets to RDF [RDF123, XLWrap] • Practical and helpful systems but … • Require significant manual work • Do not generate linked data • Interpreting web tables to answer complex search queries over the web tables (Limaye et al. 2010)
T2LD Framework T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations
T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations
Predicting Class Labels for column Class Class 1 Class for the column Class 2 Class 3 Class 4 Instance
Knowledge Base Yago Wikitology1 – A hybrid knowledge base where structured data meets unstructured data 1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation
Querying the Knowledge–Base Types {dbpedia-owl:Place,dbpedia-owl:City,yago:WomenArtist,yago:LivingPeople,yago:NationalBasketballAssociationTeams } 1. Chicago Bulls 2. Chicago 3. Judy Chicago 1. Philadelphia 2. Philadelphia 76ers 3. Philadelphia (film) {dbpedia-owl:Place, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film,yago:NationalBasketballAssociationTeams …. ….. ….. } 1. Houston Rockets 2. Houston 3. Allan Houston {……………………………………………………………. }
Scoring the classes Possible Classes for the column - dbpedia-owl:Place dbpedia-owl:City yago:WomenArtist yago:LivingPeople yago:NationalBasketballAssociationTeams dbpedia-owl:PopulatedPlace dbpedia-owl:Film… … … [Chicago, dbpedia-owl:City] [Philadelphia, dbpedia-owl:City] [Houston, dbpedia-owl:City] …. …. [Chicago,dbpedia-owl:Film] [Philadelphia,dbpedia-owl:Film] … … … E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams” String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6] (R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5] (R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4] Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank) [Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892
T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations
Machine Learning based Approach Requery KB with predicted class labels as additional evidence Generate a feature vector for the top N results of the query Table Cell + Column Header + Row Data + Column Type A second classifier decides whether to link or not Classifier ranks the entities within the set of possible results Select the highest ranked entity Link to the top ranked instance Link to “NIL”
Learning to Rank • We trained a SVMrank classifier which learnt to rank entities within a given set Similarity Measures • Levenshtein distance • Dice Score Feature Vector • Wikitology Score • PageRank • Page Length Popularity Measures
“To Link or not to Link … ’’ • A second SVM classifier • Feature vector included the feature vector of the top ranked entity and additional two features – • The SVMrank score of the top ranked entity • The difference in scores between the top two ranked entities
T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations
Identify Relations Rel ‘A’ Rel ‘A’ Rel ‘A’, ‘C’ Rel ‘A’, ‘B’, ‘C’ Rel ‘A’, ‘B’
Relation between columns Michael Jordan - Chicago Allen Iverson - Philadelphia Yao Ming - Houston dbprop:team Candidate relations dbprop:teamdbprop:draftTeam dbprop:team dbprop:draftTeam dbprop:team
Scoring the relations dbprop:teamScore:3 Candidates: dbprop:teamdbprop:draftTeam dbprop:draftTeamScore: 0 dbprop:team Michael Jordan - Chicago Allen Iverson – Philadelphia Yao Ming - Houston dbprop:teamdbprop:draftTeam dbprop:draftTeamScore:1 dbprop:team
T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations
Table as linked RDF @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . @prefix yago: <http://dbpedia.org/class/yago/> . "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer . "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams . "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan . dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer . "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls . dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams . “Team”@en is rdfs:label of dbpedia-owl:Team . “Team” is the common / human name for the class dbpedia-owl:Team dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams . dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams
Dataset summary * The number in the brackets indicates # excluding columns that contained numbers
Evaluation # 1 (MAP) • Compared the system’s ranked list of labels against a human ranked list of labels • Metric - Mean Average Precision (MAP) • Commonly used in the Information Retrieval domain to compare two ranked sets
Evaluation # 1 (MAP) 80.76 % System Ranked: 1. Person2. Politician3. President Evaluator Ranked: 1. President2. Politician3. OfficeHolder
Evaluation # 2 (Recall) System Ranked: 1. Person2. Politician3. President Evaluator Ranked: 1. President2. Politician3. OfficeHolder Recall > 0.6 (75 %)
Evaluation # 3 (Correctness) • Evaluated whether our predicted class labels were “fair and correct” • Class label may not be the most accurate one, but may be correct. • E.g. dbpedia-owl:PopulatedPlace is not the most accurate, but still a correct label for column of cities • Three human judges evaluated our predicted class labels
Evaluation # 3 (Correctness) Column – NationalityPrediction – MilitaryConflict Column – Birth PlacePrediction – PopulatedPlace Overall Accuracy: 76.92 % • A category-wise breakdown for class label correctness
Category-wise accuracy for linking table cells Overall Accuracy: 66.12 %
Relation between columns • Idea – Ask human evaluators to identify relations between columns in a given table • Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset • Evaluators identified 20 relations • Our accuracy – 5 out of 20 (25 % ) were correct
Conclusion • We have demonstrated that it is possible to develop a automated framework for converting tables & spreadsheets to linked data • Extending and adapting this framework for Open government data • Discovery of new relations between entities
References • Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., Zhang, Y., 2008. Webtables:exploring the power of tables on the web. Proc. VLDB Endow.1 (1), 538-549. • Barrasa, J., Corcho, O., Gomez-perez, A., 2004. R2o, an extensible and semantically based database-to-ontology mapping language. In Proceedings of the 2nd Workshop on Semantic Web and Databases(SWDB2004). Vol. 3372. pp. 1069-1070. • Hu, W., and Qu, Y. 2007. Discovering simple mappings between relational database schemas and ontologies. In Aberer, K.; Choi, K.-S.; Noy, N. F.; Allemang, D.; Lee, K.-I.; Nixon, L. J. B.; Golbeck, J.; Mika, P.; Maynard, D.; Mizoguchi, R.; Schreiber, G.;and Cudre-Mauroux, P., eds., ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, 225238. Springer. • Papapanagiotou, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; and Hadjiefthymiades, S. 2006. Ronto: Relational to ontology schema matching. In AISSIGSEMIS BULLETIN.
References • Lawrence, E. D. R. 2004. Composing mappings between schemas using a reference ontology. In In Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE), 783800. Springer • Han, L.; Finin, T.; Parr, C.; Sachs, J.; and Joshi, A. 2008. RDF123: from Spreadsheets to RDF. In Seventh International Semantic Web Conference. Springer. • Han, L., Finin, T., Yesha, Y., 2009. Finding semantic web ontology terms from words. In: Proceedings of the Eight International Semantic Web Conference. Springer. • Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proc. of the 36th Int'l Conference on Very Large Databases (VLDB). (2010)