200 likes | 307 Views
Probabilistic Information Integration Maurice van Keulen, Ander de Keijzer. Automatic data integration. Why does data integration take so long, why not automatic? The schema mismatch problem The data conversion/mapping problem
E N D
Probabilistic Information IntegrationMaurice van Keulen, Ander de Keijzer Dagstuhl Seminar 08421 - Probabilistic Information Integration
Automatic data integration Why does data integration take so long, why not automatic? • The schema mismatch problem • The data conversion/mapping problem • The overlapping data problem(entity resolution / record linkage / data cleaning) • Proverbial 90% of the cases is straightforwardcan be done with little development effort • Proverbial 10% of the cases are hardtake most of the development time Let’s simply not solve those 10% rightaway! Let’s go for an initial integration that can readily be used “Good is good enough” for many applications Let it improve over time during use Dagstuhl Seminar 08421 - Probabilistic Information Integration
How to deal with remaining 10% • Conflict between sources ≠ inconsistency= (Independent) observations • Data conflicts and partial/ambiguous matchings are symptoms of semantic uncertainty Our approach to data integration: • Define few rules to resolve only proverbial 90% of the cases • Store initial integration result as uncertain data • Start using the integrated data(time-to-market 10x earlier) • Queries will return uncertain answers • But integrated data can already meaningfully used • Feedback on use gradually improves integration(e.g., feedback on query answers) Dagstuhl Seminar 08421 - Probabilistic Information Integration
Data integration process (semi-)automatic user interaction Solve remaining semantic uncertainty during use 1. Data integrationwith external source Allow early meaningful use of integrated data 2. Query 3. Feedback(statement about query answer) Uncertainqueryanswer DB DB DB lessuncertain certain uncertain Dagstuhl Seminar 08421 - Probabilistic Information Integration
What we built DemoGUI Focus of talk • Differences / correspondences between probabilistic XML and relational DBs • Probabilistic integration algorithm • What would defy my purpose? • What is quality? (metrics) • When is it good enough?(experiments) ProbabilisticIntegrationFunctionality IMPrECISE ProbabilisticXMLDatabase αML XMLDBMS MonetDB/XQuery Dagstuhl Seminar 08421 - Probabilistic Information Integration
probabilistic node possibility tag XML node with tag name ‘tag’ Data representation Probabilistic XML tree represents all possible worlds in one tree Possible worlds • Movie list with 1 movie (King Kong/1933)probability 8% • Movie list with 1 movie (King Kong/1976)probability 32% • Movie list with 2 movies(King Kong/1933 and King Kong/1976)probability 60% Can express uncertainty about existence, dependent and independent choice 1 choice points movies .4 .6 movie movie movie .2 .8 1 1 1 1 1 tl yr yr tl yr tl yr King Kong 1933 1976 King Kong 1933 King Kong 1976 Dagstuhl Seminar 08421 - Probabilistic Information Integration
Differences / correspondencesXML vs. relational What to say about our probabilistic XML DBMS Representation: • Choice point (▼) = variable / x-tupleAlternative (O) = possible var assignment / alternative • Dependencies expressed in ancestor/descendant= event expression / lineage formula Querying • In XPath/XQuery vs. SQL • Semantics of querying according to possible world theory. Scalable implementation by working directly on compact/succint representation Dagstuhl Seminar 08421 - Probabilistic Information Integration
Motivating example Scenario of demo on ICDE april 2008: • Portal with daily recommendation of movies on TV • Source 1 : TV guide (e.g., www.tvguide.com) • Enrich with information of Source 2 : IMDB • Combined 18 ‘attributes’ of which 6 overlap • Entity resolution problem with movies and actors Dagstuhl Seminar 08421 - Probabilistic Information Integration
Movie Title: King Kong Year 1976 Year 1933 Rating: 8.0; 5.5 Movie Title: King Kong Movie Year 1933 Title: King Kong Rating: 8.0 Year 1976 Rating: 5.5 Movie Movie Title: King Kong Title: King Kong Year 1976 Year 1933 Rating: 8.0 Rating: 5.5 Movie Movie Title: King Kong Title: King Kong Year 1933 Year 1976 Rating: 8.0 Rating: 5.5 Uncertainty concerningentity resolution Same movie;for conflictingfields, both are correct Different movies Schema may exclude this possibility Same movie;for conflictingfields, one is correct Dagstuhl Seminar 08421 - Probabilistic Information Integration
Integration functionality Integration algorithm = XML Tree-merge (in recursive descent fashion) • Similarity matching (In Christoph’s words) • Repair-key • Select worlds that satisfy background knowledge • Rules / Constraints • Thresholds Strict separation of concerns • Integration mechanism:enumeration of possibilities + XML tree merge • Integration intelligence:background knowledge + similarity matching Dagstuhl Seminar 08421 - Probabilistic Information Integration
Result • A compact/succinct representation of all possible merged XML trees Why in this way? • Result need not be perfectAn integration result of ‘good enough quality’ sufficesSemantical issues in data integration not an obstacle Knowledge needed for meaningful use • Schema info (e.g., movies have 1 year child) • Some thresholds (e.g., less than 50% match on titles means not the same movie title) • Few domain specific rules (e.g., for (possibly) the same movie, if actors agree on role and role is unique in movie, then decide same actor regardless of difference in name) • Automatic fallback: edit distance similarity (should be something better) I on purpose use bad similarity matcher Dagstuhl Seminar 08421 - Probabilistic Information Integration
What would defy my purpose? Purpose is to significantly reduce software development effort for obtaining integration of sources that is good enough (reduce time-to-market) • What is good enough? • Useful metrics for data quality • Threshold on metric when good enough • I do not reduce anything if • Need to manually define and fine-tune many rules • Need to fine-tune thresholds for sufficient accuracy • Feedback should be able to effectively improve quality Dagstuhl Seminar 08421 - Probabilistic Information Integration
Metrics for integration result • Metrics for uncertainty • # possible worlds • Uncertainty density= average number of alternatives per choice point • Metrics for probability assignment • Answer decisivenessTwo 50/50 alternatives are less decisive then 90/10 Dens: .25 .17 .22 Dec: .83 .89 .72 Dagstuhl Seminar 08421 - Probabilistic Information Integration
How to measure answer quality? Year of movie “King Kong”?//yr[../tl=“King Kong”] • (1933) : 40%×20% = 8%(1976) : 40%×80% = 32%(1933,1976) : 60% • Ranking by probability:1976 at 92% (better: 1x 1976 at 92%, 0x 1976 at 8%)1933 at 68% (better: 1x 1933 at 68%, 0x 1933 at 32%) • Suggests IR-like precision and recall, but • Query answers are possibly not distinct • Correct answer with high probability is better than one with low probability (and vice versa for incorrect answers) • Approach • Answer only exists for asmuch as its probability • Expected value of precision and recall Dagstuhl Seminar 08421 - Probabilistic Information Integration
Answer quality measurement Year of movie “King Kong”?//yr[../tl=“King Kong”] • (1933) : 40%×20% = 8%(1976) : 40%×80% = 32%(1933,1976) : 60% • Ranking by probability:1976 at 92%1933 at 68% • Suppose 1976 is a correct answer and 1933 is not • EXP(Precision) = EXP(correct) / EXP(all answers) = 0.92 / 1.6 = 57.5%EXP(Recall) = EXP(correct) / |Human| = 0.92 / 1 = 92% Dagstuhl Seminar 08421 - Probabilistic Information Integration
Data: few “Today’s picks” from TV guide enriched with IMDB source with 243000 movies. 18 attrs in total; 6 overlapping. Queries: 43 XPath queries Too many rules? • Isn’t development effort not simply shifted to rule definition and threshold tuning? • Rules: DTD-info + 1 ‘rough’ rule per entity suffices • Thresholds: Quality insensitive to ‘safe’ thresholds Advice to developer: Don’t worry about perfecting the rules and thresholds. Strive for an in initial query result that can be queried with about 90% entities resolved. For the 10% hard cases just make sure that you don’t miss the one correct match (user feedback cannot invent matches). Dagstuhl Seminar 08421 - Probabilistic Information Integration
User feedback • User feedback = statement about query answer • Usually, user feedback can be naturally embedded in user interaction • Example: • Contacts application in your mobile phone, integrated/synchronized with company phone list, PC at home, other people’s phones (community) with the aim to automatically pick up changes • Phone application ranks possible phone numbers according to likelihood for dialing • Phone application can automatically give feedback • Dialed number gave error ‘invalid number’ • Both “End call” and “Wrong number” buttons • No significant additional interaction needed Dagstuhl Seminar 08421 - Probabilistic Information Integration
Data: integration result @ margin 4, threshold 0.8 Queries: 43 XPath queries Feedback: several series of 40 consecutive feedbacks Each feedback randomly chosen from possible ones UF effective? • Is user feedback effective enough to quickly and effectively improve integration quality? Negative feedback Positive feedback Mixed feedback Precision Recall Dagstuhl Seminar 08421 - Probabilistic Information Integration
Conclusions • Many correspondences between probabilistic XML and relational database • Simple model for uncertainty in data with well-understood semantics suffices: possible world model with discrete choices • Seems appropriate for schema and data integration for many applications (e.g. portals): early meaningful use of integrated data, improves during use with feedback • My worries: • First proposals for some quality metrics • Few rules and safe thresholds suffice • Mixed targeted user feedback effective in quickly improving integration quality Dagstuhl Seminar 08421 - Probabilistic Information Integration
Opportunities • Put probabilistic relational DBMS underneath • Techniques for deriving (imperfect) (conditional) functional dependencies may be used to automate rule definition • Since rules need not be perfect nor handle all cases, tool-support for non-expert users becomes possible? • User feedback may also be used to learn new rulesWork is needed to handle wrong user feedbackAnswer explanation may help in targeting user feedback • Recent works on probabilistic schema matching/mapping • More distant future: • Autonomous applications that only rely on their own data and metadata for automatic data exchange/integration“Community of co-operating applications” • We need a way to let applications automatically learn how to disambiguate things Dagstuhl Seminar 08421 - Probabilistic Information Integration