300 likes | 515 Views
Bootstrapping Pay-As-You-Go Data Integration Systems. by Anish D. Sarma, Xin Dong, Alon Halevy, Proceedings of SIGMOD'08 , Vancouver, British Columbia, Canada, June 2008. Presented by Andrew Zitzelberger. Data Integration. Offer a single-point interface to a set of data sources
E N D
Bootstrapping Pay-As-You-Go Data Integration Systems by Anish D. Sarma, Xin Dong, Alon Halevy, Proceedings of SIGMOD'08, Vancouver, British Columbia, Canada, June 2008 Presented by Andrew Zitzelberger
Data Integration • Offer a single-point interface to a set of data sources • Mediated schema • Semantic mappings • Query through mediated schema • Pay-as-you-go • Many contexts can be useful without full integration • System starts with few (or inaccurate) semantic mappings • Mappings are improved over time • Problem • Requires significant upfront and ongoing effort
Contributions • Self-configuring data integration system • Provides an advanced starting point for pay-as-you-go systems • Initial configuration provides good precision and recall • Algorithms • Mediated schema generation • Semantic mapping generation • Concept • Probabilistic mediated schema
Mediated Schema Generation • 1) Remove infrequent attributes • Ensure mediated schema contain most relevant attributes • 2) Construct weighted graph • Nodes are remaining attributes • Edges are the values of some similarity measure: s(ai, aj) • Cull edges below threshold τ • 3) Cluster nodes • Cluster is a connected component of the graph
Probabilistic Mediated Schema Generation • Allow for error є in weighted graph • Certain edges ≥ τ + є • τ - є <Uncertain edges ≤ τ + є • Cull edges < τ – є • Remove unnecessary uncertain edges • Create schema from every subset of uncertain edges
Probabilistic Mediated Schema Generation • Assign probability
Probabilistic Mapping Generation • Weighted correspondence Choose the consistent p-mapping with the maximum entropy.
Probabilistic Mapping Generation • 1) Enumerate one-to-one mappings • Mappings must contain subset of correspondences • 2) Assign probabilities that maximize entropy • Solve the following constraint maximization problem
Probabilistic Mediated Schema Consolidation • Why? • User expects a single deterministic schema • More efficient query answering • How?
Schema Consolidation Example • M = {M1, M2} • M1 contains {a1, a2, a3}, {a4}, and {a5, a6} • M2 contains {a2, a3, a4} and {a1, a5, a6} • T contains {a1}, {a2, a3}, {a4}, and {a5, a6}
Probabilistic Mapping Consolidation • Modify p-mapping • Update the mappings to match new mediated schema • Modify probabilities • Schema mapping probability by Pr(Mi) • Consolidate • Add all new mappings to new set • If mapping already in new set during addition, add probabilites
Experimental Setup • UDI – the data integration system • Accepts select-project queries (only one table) • Source data – MySQL • Query processor – Java • Jaro Winkler simularity computation – SecondString • Entropy maximization problem – Knitro • Operating System – Windows Vista • CPU – Intel Core 2 GHz • Memory – 2GB
Experimental Setup • τ = 0.85 • є = 0.02 • θ = 10%
Experiments • Domains: Movie, Car, People, Course, Bibliography • Golden Standards • Manually created for People and Bibliography • Partially created for others • 10 test queries • One to four attributes in SELECT clause • Zero to three predicates in WHERE clause
Results Estimated actual recall between 0.8 and 0.85
Experiments • Compare to other methods: • MySQL keyword search engine • KEYWORDNAIVE • KEYWORDSTRUCT • KEYWORDSTRICT • SOURCE • Unions results of each data source • TOPMAPPING • Only consider p-mapping with highest probability
Experiments • Compare against other Q&A methods: • SINGLEMED – single deterministic mediated schema • UNIONALL – single deterministic mediated schema that contains a singleton cluster for each frequent source attribute
Experiment and Results • Quality of mediated schema • Test against manually created schema
Experiment and Result • Setup efficiency • 3.5 minutes for 817 data sources • Roughly linear increase of time with data sources • Maximum-entropy problem is most time consuming
Future Work • Different schema matcher • Dealing with multiple-table sources • Including multi-table schemas • Normalizing mediated schemas
Analysis • Positives • Lots of support (proofs and experiments) • Negatives • Detail • Pictures