1 / 24

Less is More: Selecting Sources Wisely for Integration

Less is More: Selecting Sources Wisely for Integration. Xin Luna Dong (AT&T Labs  Google Inc.) Barna Saha , Divesh Srivastava (AT&T Labs-Research) VLDB’2013. “The More, The Better” — for Men. “The More, The Better” —for Women.

rae
Download Presentation

Less is More: Selecting Sources Wisely for Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Less is More: Selecting Sources Wisely for Integration Xin Luna Dong (AT&T Labs  Google Inc.) BarnaSaha, DiveshSrivastava (AT&T Labs-Research) VLDB’2013

  2. “The More, The Better” —for Men

  3. “The More, The Better” —for Women

  4. “The More, The Better” —for DBers

  5. Lots of money But Data Come with A Cost

  6. Lots of machines But Data Come with A Cost

  7. Lots of people But Data Come with A Cost

  8. 1260 books from the first 35 sources All 1265 books from the first 537 sources In total 894 sources, 1265 CS books 1250 books from the 10 largest sources 1213 books from the 2 largest sources And The Gain Could Be Small 1096 books from the largest source CS books from AbeBooks.com

  9. All 100 books (gold standard) from the first 548 sources 78 books w. correct authors for Vote 80 books w. correct authors for Accu 93 > 80 books w. correct authors after 583 sources (Vote) And The Gain Could Even Be Negative 90 > 80 books w. correct authors after 579 sources (Accu) CS books from AbeBooks.com

  10. Questions • Is it best to integrate all data? • How to spend the computing resources in a wise way? • How to wisely select sources before real integration to balance the gain and the cost? • Prelude for data integration and outside traditional integration tasks (schema mapping, entity resolution, data fusion) Less Is More—Source Selection[VLDB’13]

  11. 17 books w. correct authors from 300 sources (budget) 14 books (17.6% fewer) w. correct authors from the first 200 (33% less resources) sources Maximize Quality Under Budget? CS books from AbeBooks.com

  12. 81 books (25% more) w. correct authors from 526 sources (1% more) Minimize Cost w. Minimal Quality Requirement? 65 books w. correct authors (quality requirement) from the first 520 sources CS books from AbeBooks.com

  13. Marginalism Principle in Economic Theory Marginal gain II Marginal cost The law of Diminishing Returns Largest profit

  14. Marginal point with the largest profit in this ordering: 548 sources Challenge 1. The Law of Diminishing Returns does not necessarily hold, so multiple marginal points Challenge 2. Each source is different in quality, so different ordering leads to different marginal points: best solution integrates 26 sources Marginalism for Source Selection Challenge 3. Estimating gain and cost w/o real integration CS books from AbeBooks.com

  15. Input • S: a set of available sources • F: integration model • Output: subset Ŝ to maximize profit GF(Ŝ)-CF(Ŝ) • GF(Ŝ): Gain of integrating Ŝ using model F • CF(Ŝ): Cost of integrating Ŝ using model F • Gain and cost need to be in the same unit to be comparable; e.g., $ Insight I. Maximizing Profit

  16. Theorem I (NP-Completeness). Under the arbitrary cost model (i.e., different sources have different costs), Marginalism is NP-complete. • Theorem II (A greedy solution can obtain arbitrarily bad results): Let dopt be the optimal profit and d be the profit by a greedy solution. For any θ, there exists an input set of sources and a gain model s.t. d/dopt< θ. Insight II. Yes, It Is A HARD Problem

  17. Insight III. An Efficient Algorithm—GRASP Solution Improvement I. Randomly select from Top-k solutions Improvement II. Hill climbing to improve the initial solution Improvement III. Repeat r times and choose the best solution

  18. Side contributions on data fusion • The PopAccu model: monotonicity—adding a source should never decrease fusion quality • Algorithms to estimate fusion quality: dynamic programming Side Contributions

  19. Book data set: CS books at Abebooks.com in 2007 • 894 sources • 1265 books • 24364 records • Flight data set: Deep-Web sources for “flight status” in 2011 • 38 sources • 1200 flights • 27469 records Experimental Setup

  20. Maximizing Fusion Quality Marginalism selects 165 sources; reaching the highest quality 228 sources provide books in gold standard PopAccu outperforms Vote and Accu, and is nearly monotonic for “good” sources

  21. Source Selection: The Goal Marginalism has higher profit than MaxGLimitC and MinCLimitG most of the time

  22. Source Selection: The Approach Greedy solution often cannot find the optimal solution GRASP (top-10, repeating 320 times) obtains nearly optimal results

  23. Full-fledged source selection for data integration • Other quality measures: e.g., freshness, consistency, redundancy; correlations, copying relationships between sources • Complex cost and gain models • Selecting subsets of data from each source • Other components of data integration: schema mapping, entity resolution Future Work

  24. The More the Better? OR Less is More?

More Related