1 / 32

Statistical Schema Matching across Web Query Interfaces

Statistical Schema Matching across Web Query Interfaces. SIGMOD 2003. Bin He , Kevin Chen-Chuan Chang. Background: Large-Scale Integration of the deep Web. Query. Result. The Deep Web. Challenge: matching query interfaces (QIs). Book Domain. Music Domain.

wilda
Download Presentation

Statistical Schema Matching across Web Query Interfaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Schema Matching across Web Query Interfaces SIGMOD 2003 Bin He, Kevin Chen-Chuan Chang

  2. Background: Large-Scale Integration of the deep Web Query Result The Deep Web

  3. Challenge: matching query interfaces (QIs) Book Domain Music Domain

  4. Traditional approaches of schema matching –Pairwise Attribute Correspondence • Scale is a challenge • Only small scale • Large-scale is a must for our task • Scale is an opportunity • Useful Context S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category

  5. Deep Web Observation • Proliferating sources • Converging vocabularies

  6. A hidden schema model exists? • Our View (Hypothesis): Instantiation probability:P(QI1|M) P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities

  7. A hidden schema model exists? Instantiation probability:P(QI1|M) • Our View (Hypothesis): • Now the problem is: P M QI1 QIs Finite Vocabulary Statistical Model Generate QIs with different probabilities P M Given , can we discover ? QIs

  8. MGS framework & Goal • Hypothesis modeling • Hypothesis generation • Hypothesis selection • Goal: Verify the phenomenons Validate MGSsd with two metrics

  9. Comparison with Related Work

  10. Outline • MGS • MGSsd: Hypothesis Modeling, Generation, Selection • Deal with Real World Data • Final Algorithm • Case Study • Metrics • Experimental Results • Conclusion and Future Issues • My Assessment

  11. 2. Given QIs, Generate the model candidates M1 M2 P(QIs|M) > 0 AA BB CC SS TT PP 3. Select the candidate with highest confidence M1 What is the confidence of given ? AA BB CC Towards hidden model discovery: Statistical schema matching (MGS) M 1. Define the abstract Model structure M to solve a target question P(QI|M) = …

  12. MGSSD: Specialize MGS for Synonym Discovery • MGS is generally applicable to a wide range of schema matching tasks • E.g., attribute grouping • Focus : discover synonym attributes Author – Writer, Subject – Category • No hierarchical matching: Query interface as flat schema • No complex matching: (LastName, FirstName) – Author

  13. Mutually Independent Concepts Attributes Mutually Exclusive Hypothesis Modeling: Structure • Goal: capture synonym relationship • Two-level model structure • Possible schemas: I1={author, title, subject, ISBN}, I2={title,category, ISBN} No overlapping concepts

  14. Hypothesis Modeling: Formula • Definition and Formula: • Probability that M can generate schema I:

  15. P(C1|M) * P(author|C1) = C1 author 2.Observing a schema P({author, ISBN, subject}|M) = P(author|M) * P(ISBN|M) * P(subject|M) * (1 – P(C2|M)) 3.Observing a schema set P(QIs|M) = П P(QIi|M) Hypothesis Modeling: Instantiation probability 1.Observing an attribute P(author|M) = α1 * β1

  16. Consistency check • A set of schema I as schema observation • <Ii,Bi>:number of occurrences Bi for each Ii • M is consistent if Pr (I|M)>0 • Find consistent models as candidates

  17. Hypothesis Generation • Two sub-steps 1. Consistent Concept Construction 2.Build Hypothesis Space

  18. M1 M3 M2 C2 C2 C1 C1 M4 C1 C2 C3 author author subject category category subject C1 C2 author subject category M5 C1 author subject category author subject category Hypothesis Generation: Space pruning • Prune the space of model candidates • Generate M such that P(QI|M)>0 for any observed QI • mutual exclusion assumption • Co-occurrence graph • Example: • Observations: QI1 = {author, subject} andQI2 = {author, category} • Space of model: any set partition of {author, subject, category}

  19. M1 M3 M2 C2 C2 C1 C1 M4 C1 C2 C3 author author subject category category subject C1 C2 author subject category M5 C1 author subject category author subject category Hypothesis Generation • Prune the space of model candidates • Generate M such that P(QI|M)>0 for any observed QI • mutual exclusion assumption • Example: • Observations: QI1 = {author, subject} andQI2 = {author, category} • Space of model: any set partition of {author, subject, category} • Model candidates after pruning:

  20. Hypothesis Generation (Cont.) • Build Probability Functions • Maximum likelihood estimation • Estimate ai and Bj that maximize Pr (I|M)

  21. Hypothesis Selection • Rank the model candidates • Select the model that generates the closest distribution to the observations • Approach: hypothesis testing • Example: select schema model at significance level 0.05 • =3.93 3.93<7.815: accept • =20.20 20.20>14.067: reject

  22. Dealing with the Real World Data • Head-often, tail-rare distribution • Attribute Selection Systematically remove rare attributes • Rare Schema Smoothing Aggregate infrequent schemas into a conceptual event I(rare) • Consensus Projection Follow concept mutual independence assumption Extract and aggregate New input schemas with re-estimation para.

  23. Final Algorithm • Two phases: Build initial hypothesis space Discover the hidden model Attribute Selection Extract the common parts of model candidates of last iteration Hypothesis Generation Combine rare interfaces Hypothesis Selection

  24. Experiment Setup in Case Studies • Over 200 sources on four domains • Threshold f=10% • Significance level : 0.05 • Can be specified by users

  25. Example of the MSGsd Algorithm M1={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,ln), (fn)} M2={(ti), (is), (kw), (pr), (fm), (pd), (pu), (su,cg), (au,fn), (ln)}

  26. Metrics • 1. How it is close to the correct schema model • Precision: • Recall: • 2. How good it can answer the target question • Precison: • Recall:

  27. Examples on Metrics • I={<I1,6>, <I2,3>, <I3,1>} • I1={author, subject}, I2={author, category}, I3={subject} • M1={(author:1):0.6, (subject:0.7,category:0.3):1} • M2={(author:1):0.6, (subject:1):0.7, (category:1):0.3} • Metrics 1: • Pm(M2,Mc)=0.196+0.036+0.249+0.054=0.58 • Rm(M2,Mc)=0.28+0.12+0.42+0.18=1 • Metrics 2:

  28. Experimental Results The discovered synonyms are all correct ones Can generate all correct instances • This approach can identify most concepts correctly • Incorrect matchings due to small # observations • Do need two suites of metrics • Time complexity is exponential

  29. S1: author title subject ISBN S2: writer title category format S3: name title keyword binding V.S. Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category Advantages • Scalability: large-scale matching • Solvability: exploit statistical information • Generality S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Holistic Model Discovery category author writer name subject

  30. Conclusions & Future Work • Holistic statistical schema matching of massive sources • MGS framework to find synonym attributes • Discover hidden models • Suited for large-scale database • Results verify the observed phenomena and show accuracy and effectiveness • Future Issues • Complex matching: (Last Name, First Name) – Author • More efficient approximation algorithm • Incorporating other matching techniques

  31. My Assessments • Promise • Use minimal “light-weight” information: attribute name • Effective with sufficient instances • Leverage challenge as opportunity • Limitation • Need sufficient observations • Simple Assumptions • Exponential time complexity • Homonyms

  32. Questions

More Related