1 / 34

Automatically Generating Linked Data from Tables

Automatically Generating Linked Data from Tables. Varish Mulwad ( @ varish ) University of Maryland, Baltimore County November 15, 2011. What ?. dbpedia-owl:state. http:// dbpedia.org/class/AdministrativeRegion. http://dbpedia.org/resource/Arizona. Map literals as values of properties.

eryk
Download Presentation

Automatically Generating Linked Data from Tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Generating Linked Data from Tables VarishMulwad (@varish)University of Maryland, Baltimore CountyNovember 15, 2011

  2. What ?

  3. dbpedia-owl:state http://dbpedia.org/class/AdministrativeRegion http://dbpedia.org/resource/Arizona Map literals as values of properties IntroductionRelated WorkBaselineResults Joint Inference  Conclusion

  4. Contribution @prefix dbpedia: <http://dbpedia.org/resource/>. @prefix dbpedia-owl: <http://dbpedia.org/ontology/>. @prefix dbpprop: <http://dbpedia.org/property/>. @prefix dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#>. ”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:statedbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal operators”@en; dbpedia-owl:number 6444]. All this in a completely automated way !!  IntroductionRelated WorkBaselineResults Joint Inference  Conclusion

  5. Why ?

  6. Tables are everywhere !! … yet … The web – 154 millionhigh quality relational tables [1] IntroductionRelated WorkBaselineResults Joint Inference  Conclusion

  7. Evidence–based medicine The idea behind Evidence-based Medicine is to judge the efficacy of treatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables. # of Clinical trials published in 2008 # of meta analysis published in 2008 However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment … Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010

  8. > 400,000raw and geospatial datasets~< 1 % in RDF IntroductionRelated WorkBaselineResults Joint Inference  Conclusion

  9. Current Systems • Require users to have knowledge of the Semantic Web • Do not automatically link to existing classes and entities on the Semantic Web / Linked Data cloud • RDF data in some cases is as useless as raw data • Majority of the work focused on relational data where schema is available • Web tables systems use ‘semantically poor knowledge bases’ IntroductionRelated WorkBaselineResults Joint Inference  Conclusion

  10. How ?

  11. Building a table interpretation framework • Preliminary work / Baseline system • Analysis and Evaluation of baseline • “Domain Independent” Framework grounded in graphical models and probabilistic reasoning Introduction Related WorkBaselineResults Joint Inference  Conclusion

  12. The System’s Brain (Knowledgebase) Yago Wikitology1 – A hybrid knowledgebase where structured data meets unstructured data Syed, Z., and Finin, T. 2011. Creating and Exploiting a Hybrid Knowledge Base for Linked Data, volume 129 of Revised Selected Papers Series: Communications in Computer and Information Science. Springer. 1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation

  13. The Baseline System

  14. T2LD Framework T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Introduction Related WorkBaselineResults Joint Inference  Conclusion

  15. Predicting Class Labels for column Class {dbpedia-owl:Place, dbpedia-owl:AdministrativeRegion,yago:StatesOfTheUnitedStates, dbpedia-owl:Band, yago:NativeAmericanTribes …} 1. Alabama 2.Alabama_(band) 3.Alabama_(people) {dbpedia-owl:Place, yago:StatesOfTheUnitedStates, dbpedia-owl:Film, …. ….. ….. } {……………………………………………………………. } dbpedia-owl:Place, dbpedia-owl:AdministrativeRegion,yago:StatesOfTheUnitedStates, dbpedia-owl:Band, yago:NativeAmericanTribes,dbpedia-owl:Film ... Instance Introduction Related WorkBaselineResults Joint Inference  Conclusion

  16. Linking table cells to entities 1. Macon County, Alabama 2. Macon County, Illinois Macon+ County + Alabama + 1 + 87 + Farms with Black or African American operators + ... + dbpedia-owl:AdministrativeRegion Classifier 1 – SVM Rank (Ranks the set of entities) Classifier 2 – SVM (Computes Confidence) Link to the top ranked entity Don’t link Introduction Related WorkBaselineResults Joint Inference  Conclusion

  17. Identify Relations Rel ‘A’ Rel ‘A’ Rel ‘A’, ‘C’ Rel ‘A’, ‘B’, ‘C’ Rel ‘A’, ‘B’ Introduction Related WorkBaselineResults Joint Inference  Conclusion

  18. Generating a linked RDF representation @prefix dbpedia: <http://dbpedia.org/resource/>. @prefix dbpedia-owl: <http://dbpedia.org/ontology/>. @prefix dbpprop: <http://dbpedia.org/property/>. @prefix dgtwc: <http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#>. ”State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:statedbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal operators”@en; dbpedia-owl:number 6444]. Introduction Related WorkBaselineResults Joint Inference  Conclusion

  19. Evaluation of the baseline system

  20. Dataset summary * The number in the brackets indicates # excluding columns that contained numbers Introduction Related WorkBaselineResults Joint Inference  Conclusion

  21. Evaluation # 1 (MAP) • Compared the system’s ranked list of labels against a human–ranked list of labels • Metric - Average Precision (a.p.) [Mean Average Precision gives a mean over set of queries] • Commonly used in the Information Retrieval domain to compare two ranked sets Introduction Related WorkBaselineResults Joint Inference  Conclusion

  22. Evaluation # 1 (MAP) System Ranked: 1. Person2. Politician3. President Evaluator Ranked: 1. President2. Politician3. OfficeHolder MAP = 0.411 Introduction Related WorkBaselineResults Joint Inference  Conclusion

  23. Accuracy for Entity Linking Overall Accuracy: 66.12 % Introduction Related WorkBaselineResults Joint Inference  Conclusion

  24. Lessons Learnt T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations • Sequential System – Error percolated from one phase to the next • Current system favors general classes over specific ones (MAP score = 0.411) • Largely, a system driven by “heuristics” • Although we consider evidence, we don’t do assignment jointly Introduction Related WorkBaselineResults Joint Inference  Conclusion

  25. A “Domain Independent” Framework Domain Knowledge – Linked Data Cloud / Medical Domain / Open Govt. Domain KB Query KB m,n,o,… x,y,z,… Probabilistic Graphical Model / Joint Inference Model a,b,c,… Linked Data

  26. Joint Inference over evidence in a table • Probabilistic Graphical Models

  27. Parameterized graphical model Captures interaction between row values R33 R11 R12 R13 R21 R22 R23 R31 R32 Row value Factor Node C2 C1 C3 Function that captures the affinity between the column headers and row values Variable Node: Column header Captures interaction between column headers Introduction Related WorkBaselineResultsJoint Inference  Conclusion

  28. Challenges

  29. Challenges - Literals Population / Profit ? Age / Percentage ? Use evidence from the rest of the table to decide Introduction Related WorkBaselineResultsJoint InferenceConclusion

  30. Challenges - Metadata Introduction Related WorkBaselineResultsJoint InferenceConclusion

  31. More Challenges ! • Sampling and Interpretation • Data set 1425 has > 400,000 rows ! • Human in the Loop Introduction Related WorkBaselineResultsJoint InferenceConclusion

  32. Conclusion • Presented a framework for inferring the semantics of tables and generating Linked data • Evaluation of the baseline system show feasibility in tackling the problem • Work in progress for building framework grounded in graphical models and probabilistic reasoning • Working on tackling challenges posed by tables from domains such as the medical and open government data Introduction Related WorkBaselineResultsJoint InferenceConclusion

  33. References • Cafarella, M. J.; Halevy, A. Y.; Wang, Z. D.; Wu, E.; and Zhang, Y. 2008. Webtables:exploring the power of tables on the web. PVLDB 1(1):538–549 • M. Hurst. Towards a theory of tables. IJDAR,8(2-3):123-131, 2006. • D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Document Analysis Systems, pages 164-175, 2006. • Wang, Jingjing, Shao, Bin, Wang, Haixun, and Zhu, Kenny Q. Understanding tables on the web. Technical report, Microsoft Research Asia, 2010. • VenetisPetros, Halevy Alon, MadhavanJayant, Pasca Marius, Shen Warren, Wu Fei, Miao Gengxin, and Wu Chung. Recovering semantics of tables on the web. In Proc. of the 37th Int'l Conference on Very Large Databases (VLDB), 2011. • LimayeGirija, SarawagiSunita, and ChakrabartiSoumen. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB), 2010

  34. Thank You ! Questions ? varish1@cs.umbc.edu@varish http://ebiq.org/h/Varish/MulwadProject Page: http://ebiq.org/j/96 finin@cs.umbc.edu joshi@cs.umbc.edu

More Related