200 likes | 319 Views
Mapping Maintenance for Data Integration Systems. Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005. wrapper. homeseekers.com. Data Integration Systems. Find homes under $300K.
E N D
Mapping Maintenance for Data Integration Systems Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005
wrapper homeseekers.com Data Integration Systems Find homes under $300K mediated schema source schema 1 source schema 2 source schema 3 wrapper wrapper yahoo.com windermere.com
Mapping Maintenance is a Key Bottleneck • Constructing mappings has proven difficult… • (see first speaker) • …but maintenance often quickly dominates cost • E.g., Integrated Genome Database Project [Stein, 03] • 12 genomic databases, each remodeled data twice per year • System broke every two weeks, abandoned after 1 year • E.g., Integration Project at Illinois • Integrated 400 DB researcher homepages • 2 system administrators, stopped after 3 months Reducing maintenance costs is now crucial!
cost | city | numbeds | numbaths cost | city | numbeds | numbaths price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 wrapper wrapper 5 weeks later (source has changed) homeseekers.com homeseekers.com Problem Definition mediated schema mediated schema ? price location beds baths $180,000 61801 2 2 $260,000 98195 3 2
cost | city | numbeds | numbaths price location beds baths wrapper wrapper 185 “Urbana, IL” 2 2 270 “Seattle, WA” 3 2 homeseekers.com homeseekers.com price location beds baths price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 $180,000 “Urbana, IL” 2 2 $260,000 “Seattle, WA” 3 2 Example 1: Change Source Schema or Data • Update tuples • Change units of price wrapper homeseekers.com
cost | city | numbeds | numbaths wrapper wrapper homeseekers.com homeseekers.com price location beds baths price location beds baths price location beds baths $185,000 “Century 21” 2 2 $270,000 “RE/MAX” 3 2 $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 $185,000 61801 2 2 $270,000 98195 3 2 $185,000 61801 2bed/2bath Century 21 $185,000 Urbana, IL 2bed/2bath Century 21 $185,000 - Urbana, IL 2bed/2bath Century 21 Example 2: Change Presentation Format • Display location as zipcode • Rearrange page layout wrapper homeseekers.com
The MAVERIC Approach Suppose administrator wants to maintain mappings for 1 year 1. For a short initial period (e.g., 5 weeks) • Administrator manually verifies each mapping • MAVERIC probes the source to learn data characteristics 2. For remaining time (e.g., 47 weeks) • MAVERIC probes the source to observe new data instances • MAVERIC outputs an alarm if characteristics differ • If an alarm, administrator repairs mappings
price location beds baths wrapper wrapper wrapper 132 “Century 21” 1 2 365 “RE/MAX” 2 4 price location beds baths price location beds baths $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 $132,000 “Salem, OR” 2 1 $365,000 “Atlanta, GA” 4 2 homeseekers.com on week 5 homeseekers.com on week 1 homeseekers.com on week 6 Example • Training phase Learned data characteristics • Verification phase If beds < baths, output alarm If average price < 100,000, output alarm If layout of attributes changes, output alarm
Contributions • Develop core MAVERIC system • An ensemble of sensors that exploit multiple characteristics of data • A combiner that leverages the most effective sensors • Significantly improve core system • Generate synthetic data to improve training • Leverage external data to improve training • Employ filters to reduce false alarms • Extensive evaluation over 114 sources in 6 domains • Core MAVERIC outperforms related work, improving F-1 by 4-19% • Enhancements further improve F-1 by 2-13%
wrapper wrapper price location beds baths price location beds baths combiner $132,000 “Salem, OR” 2 1 $365,000 “Atlanta, GA” 4 2 $185,000 “Urbana, IL” 2 2 $270,000 “Seattle, WA” 3 2 homeseekers.com on week 5 homeseekers.com on week 1 sm s1 …... Training the Core MAVERIC System • Sensors learn internal profiles of data characteristics • Combiner learns weight for each sensor employ Winnow to learn weights layout of attributes in HTML pages: price location beds / baths avg value of price
price location beds baths 132 “Century 21” 1 2 365 “RE/MAX” 2 4 combiner scorem sm s1 …... layout of attributes has changed wrapper homeseekers.com on week 6 Verifying with the Core MAVERIC System • Sensors leverage internal profiles to output sensor scores • Combiner combines scores based upon weights alarm if combined score ≥ θ score1 new avg price
combiner s1 sm …... query results at t1 wrapper source S at t1 Improving Training via Perturbation • Idea: expand training data by generating synthetic data • Simulate natural source changes during training • Source data changes, e.g., insert and delete tuples • Presentation format changes, e.g., $29.99 becomes 29.99 USD perturber - apply change - reapply wrapper - test results perturbed results training data for S original results query results at tn wrapper source S at tn System “practices ahead of time”
combiner s1 sm …... price location beds baths $185,000 “Urbana, IL” 3 2 perturbation Example: Reformatting Price training data perturbed training example original training example price location beds baths ? = 185,000 USD “Urbana, IL” 3 2 original results perturbed results wrapper wrapper $185,000 Urbana, IL 3bed/2bath… 185,000 USD Urbana, IL 3bed/2bath… original HTML perturbed HTML homeseekers.com
description cost Additional Improvements • Improve training by borrowing data from other sources mediated schema source schema source schema category price comments amount wrapper wrapper “This…” 185,000 USD house $185,000 S S’ • Reduce false alarms via filtering • Web Search Engines: • “price is 185,000 USD” • “costs 185,000 USD” Other Sources: • Monetary Recognizers: • $185,000 • $185000.00 potentially corrupt attribute price price is valid 185,000 USD amount 210 K (see paper for details)
Empirical Evaluation • Test verification ability over 114 sources in 6 domains
Core MAVERIC Outperforms Prior Work • Compare with recent system [Lerman et al, Journal of AI Research 03] Achieve F-1 from 82-93%, an improvement of 4-19% in all domains
Sensor Ensemble Sensor Ensemble + Perturbation Sensor Ensemble + Perturbation + Multi-Src Train Sensor Ensemble + Perturbation + Multi-Src Train + Filtering 1 0.9 F-1 0.8 0.7 0.6 Flights Books Researchers Real Estate Inventory Courses Enhancements Boost Performance • Progressively enhanced versions of MAVERIC Each enhancement improved F-1 in at least 4 domains
Reasons for Mistakes • Unrecognized instance formats • E.g., trained over TIME with format 2:00 pm, source changed format to 1400, output false alarm • E.g., trained over DAYS with format M-W-F, source changed format to Mon Wed Fri, output false alarm • Train with additional perturbations? Leverage more sources? • Attributes with similar values • E.g., trained with ORDER-DATE before SHIP-DATE, source reversed order, missed alarm on reversed values (ORDER-DATE = 7/13/2004, SHIP-DATE = 7/4/2004) • Include additional domain constraints?
Related Work • Schema matching • [Dhamankar et al, 04], [He & Chang, 03], [Kang & Naughton, 03], [Rahm & Bernstein, 01], [Doan, 01] • Quantify semantics to compute matching scores • Activity monitoring • [Shavlik & Shavlik, 04], [Lazarevic et al, 03], [Stolfo et al, 01], [Fawcett & Provost, 99], [Allan et al, 98] • Profile normal behavior to detect notable events (e.g., intrusions) • Mapping and wrapper maintenance • Wrapper verification: [Lerman et al, 03], [Kushmerick, 00] • Mapping and wrapper repair: [Velegrakis et al, 03], [Meng et al, 03], [Chidlovskii, 01]
Conclusion & Future Work • Developed MAVERIC to reduce maintenance costs • An ensemble of sensors that exploit multiple characteristics of data • Significantly improved core system • Perturbation, multi-source training, and filtering • Extensively evaluated over 114 sources in 6 domains • Core outperformed related work, improving F-1 by 4-19% • Enhancements further improved F-1 by 2-13% • Future work • Further improve and evaluate MAVERIC • Develop a solution for repairing broken mappings