190 likes | 377 Views
Historical Data Integration based on Collective Intelligence . Vladimir Zadorozhny Graduate Information Science and Technology Program School of Information Sciences University of Pittsburgh. NADM Group. Challenge. Consolidated Structured Information. WHD Data Integration
E N D
Historical Data Integration based on Collective Intelligence Vladimir Zadorozhny Graduate Information Science and Technology Program School of Information Sciences University of Pittsburgh NADM Group WHD Colloquium, March 27, 2012
Challenge Consolidated Structured Information WHD Data Integration Infrastructure Diverse , Heterogeneous, Semi-structured Data Sources
Web of Data? • Linked Data: using the Web to create typed links between data from different sources • Linked Data uses RDF (Resource Description Framework) to make typed statements (triples) • Expected result: Web of Data extending the Web with a global data space connecting diverse domains (people, companies, publications , etc.) • In general, Web of Data has a potential (still questionable) to support loose data coupling that may facilitate more efficient data utilization • While WHD can utilize LD and related Web mashup technologies to some extent, it would be premature to rely upon the Linked Data infrastructure WHD Colloquium, March 27, 2012
Dataverse Network? • An open source application to publish, share, reference, extract and analyze research data that facilitates making data available to others • "Dataverse owners can upload any file type and format (excel, txt,pdf, doc, etc.), and the files will be stored and made available in the original format“ (http://thedata.org/files/dataversehandout.pdf) • Information consumers should further integrate data sources to perform analysis using multiple "dataverses". • While WHD aims to be a part of the Dataverse Network, it would not encourage users to contribute data in ANY format. Instead, users integrate their data into the WHD repository while submitting the data. • To summarize, WHD infrastructure crowdsourses the data integration task, not just data contribution task. WHD Colloquium, March 27, 2012
General WHD Architecture Data Submission System Information Consumers Structured homogeneous historical data Annotated historical data Fused historical data Internal Data Reliability Assessment External Data Reliability Assessment Data Fusion Wrapper Registration Wrapper Wrapper … Wrapper Generation Heterogeneous historical data sources Information Providers WHD Colloquium, March 27, 2012
Simple Scenario select * from Population WHD Infrastructure Extendable Target Schema (relational is not mandatory): Source | Location | From | To | Population | Mapping: Territories -> Location Population -> Population Data Aggregation -> Total Year -> From,To Mapping: region -> Location Population -> Population Data Aggregation -> Total Year -> From,To Wrapper Keep Data Remotely Wrapper Materialize Data Source|Location | From |To | Population| s2 | Liberia | 01/01/1950 | 12/31/1950| 824000 | s2 |Liberia | 01/01/1960 | 12/31/1960| 1,052,000 | s2 |Ivory Coals | 01/01/1950 | 12/31/1950| 2,505, 000 | s2 |Ivory Coast | 01/01/1950 | 12/31/1950| 3,692,000 | s1 |Mauritania | 01/01/1950 | 12/31/1950| 692,000 | s1 |Mauritania | 01/01/1960 | 12/31/1960| 892,000 | s1 | Senegal | 01/01/1950 | 12/31/1950| 2,543,000 | s1 | Senegal | 01/01/1960 |12/31/1960 | 3,277,000 | Data Source: s1 (xl) Data Source: s2 (doc) According to the 2006 revision of the World Population Prospects the total population in the region of Liberia in 1950 was 824,000. The average population growth percent per year for the following ten years was 2.5. For Ivory Coast those numbers are 2,505,000 and 3.6 correspondingly
Big Picture: continuously growing infrastructure (a la Wikipedia) WHD Infrastructure Data Utilization Data Curation Data Collection WHD Colloquium, March 27, 2012
WHD Prototype • Group of graduate IS students: specialproject in Advanced Data Management class (INFSCI2711) • Content Management → Pligg( Open Source Content Management System, Apache, PHP, and MySQL based) • Data IntegrationEngine→ PentahoKettle(Open Source Data IntegrationEngine, Java-basedGUI and Command Line Tools, XML baseddata transformation file) • Data providers • downloadWrapperGenerating Software • configurewrappers on theirworkstation ( usingpreconfiguredtemplates) • registerwrappers on WHD Server WHD Colloquium, March 27, 2012
Data Source Data Transformation Transformed Data XML Wrapper
Data Reliability Assessment and Data Fusion • The systems based on crowdsourcing require mechanisms to ensure data quality. • WHD Infrastructure will support efficient data curationstrategies based on advanced data reliability assessment anddata fusionmethods. • As system continuously receives new historical reports, WHD estimates reliability of this data, which evolves with respect to new evidence. • WHS uses a measure of inconsistency caused by a report to assess its internal reliability. • WHD also allows users to submit their subjective feedback on reliability of data to assess external reliability. • WHD utilizes subjective logic to combine internal and external reliabilityassessment
Historical Data: Redundancy Temporal Overlaps t1 | source_ref1 | Measles | NYC |10/10/1900 | 10/10/1920 | 700 t2 | source_ref2 | Measles| NYC |10/20/1910 | 10/30/1930 | 300 Total number of Measles cases in New York City from 1900 to 1930: 700+300 = 1000 ??? Temporal overlap between t1 and t2 500 (NY) 600 (NYC) Smallpox reports: 700 Spatial Overlaps Measles reports: 300 t3 | source_ref1 | Smallpox | NY |10/20/1900 | 10/20/1920 | 500 t4 | source_ref1 | Smallpox | NYC |10/30/1920 | 10/30/1930 | 600 1900 1900 1910 1910 1920 1920 1930 1930 Total number of Smallpox cases in New York State from 1900 to 1930: 500+600 = 1100 ??? Spatial overlap between t3 and t4 Naming Overlaps t5 | source_ref1 | Yellow fever | NY |10/10/1900 | 10/10/1920 | 700 t6 | source_ref2 | Hepatitis | NY|10/10/1900 | 10/10/1920 | 700 t7 | source_ref4 | Hepatitis B| NY| 10/20/1910 | 10/30/1930 | 300 Total number of Hepatitis cases in New York State from 1920 to 1930: 700+700+300 =1700 ??? Naming overlap between t5, t6 and t7
Historical Data: Inconsistency R1: 700 200 Measles reports in NYC: R2: 500 400 300 ………. Redundant and Inconsistent : time WHD Colloquium, March 27, 2012
ICTS: Motion Chart Animation CV CV CV
Conclusion • We explore a novel approach to reliable, large-scale historical data integration based on collective intelligence • We implement this approach in WHD infrastructure for consolidation heterogeneous historical data • Major challenge: how to engage a large community of researchers to share their data and collectively resolve the data heterogeneities in a continuously growing large-scale distributed historical repository? • contributions from CHAI members (only a small fraction of Wikipedia users contributes information to ensure its growth) • as the infrastructure evolves users may become interested in “embedding” their data in a larger context to perform global analysis and to utilize WHD tools • open development platform (extendable data transformation library and toolsets) WHD Colloquium, March 27, 2012
Acknowledgements Doctoral Students: Ying-Feng Hsu Julian Lee Graduate IS Students (WHD system development team): Andrew Barnett (team leader) Andrew Entin Thomas Junker JidapaKraisangka Han Liao Eric Miller Ye Peng Evan Pulgino Henry Quattrone Mark Swartz Miao Tan Liu Yuchen Lihong Zhang WHD Colloquium, March 27, 2012