1 / 24

Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems

Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems. Chen Li Information and Computer Science University of California, Irvine. Constraints in Data Integration. Sources of Orange County (OC) housing information. Mediator. house(street,zip,price,sqft,year).

derika
Download Presentation

Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California, Irvine

  2. Constraints in Data Integration • Sources of Orange County (OC) housing information Mediator house(street,zip,price,sqft,year) s2(street, price, year) s1(street, zip, price, sqft) • Constraints: • s1 price at least $250K • s2 price at least $280K • Query: • “Find Orange County houses cheaper than $200K” • Answer: empty, without checking the source table instances

  3. Another Example Mediator house(street,zip,price,sqft,year) s2(street, price, year) s1(street, zip, price, sqft) • Constraint: OC houses have a unique street • Query: • “Find sqft and year of OC houses” • Plan: join two source relations on the “street” attributes • The plan is invalid if the constraint is not true.

  4. Importance of Constraints • They express a rich amount of information about sources • They can be utilized to help query answering

  5. Our contributions • There has been a lot of work on how to use constraints in query optimization, e.g.: • [Hsu, Knoblock, 2000] • [Godfrey, Gryz, Zuzarte, 2001] • … • To make this optimization possible, we primarily study: • how to describe constraints; and • how to manipulate constraints between “local” and “global”

  6. Outline • Constraints motivation • Local constraints • Global constraints • Conclusions and open problems

  7. Describing various constraints house(street,zip,price,year,sqft) • Examples: • C1: s1.price >= $250K • C2: s1.street is unique • C3: s2.year < 1995, s2.price >= $280K • C4: street is unique for all OC houses • C5: all houses in the system are at least $250K • Where to put them? • Some constraints can be described at sources “locally” (C1,C2,C3) • Others are more suitable to be described “globally” (C4,C5) s1(street, zip, price, sqft) s2(street, price, year)

  8. Local constraints: described at sources • Designed by individual sources • Common local constraints: • Range constraints • “price >= $250K” • Enumeration constraints • “state in {CA, NV, AZ, OR}” • Functional dependencies • “(street, zip)  (price, year)” • Key constraints • “(street) is a primary key of house predicate” • Foreign-key constraints among tables at a source • Inclusions • …

  9. Other ways to describe local constraints • Could be described using traditional source-view definitions • E.g., in the LAV (local-as-view) approach to data integration: “C1: s1.price >= $250K” can be described as s1(street, zip, price, sqft) :- house(street, zip, price, sqft, year),price >= 250K

  10. Then why not use view/query languages to describe local constraints? • View/query languages might not be expressive enough • E.g., functional-dependency constraints • Query answering could become complicated • “Conjunctive-query rewriting” is already NP • Arithmetic comparisons even require recursive queries • Describing constraints separately have advantages: • More expressive: can capture common knowledge • We care more about those simple, common constraints • Easier to understand and do reasoning

  11. Limitations of local constraints house(street,zip,price,year,sqft) • C4: street is unique for all OC houses • Cannot describe C4 using two local constraints: • “street is a key at s1” & “street is a key at s1” • Wrong! Since they cannot restrict two relations together! • In particular, the following is still allowed • S1: (main, 92697, $300K, 2100) • S2: (main, $380K, 1993) s1(street, zip, price, sqft) s2(street, price, year)

  12. Global constraints (GC) • Some constraints are more suitable to be described “globally” • Example: • C4: street is unique for all OC houses • C5: all houses in the system are at least $250K

  13. Describing global constraints • Might need more expressive languages: • “street” is unique for houses at s1 and s2 • Need to formally define such a “global key” • Consider special cases: • We have source schemas and mediator schema • Mappings exist between them: e.g., LAV, GAV, GLAV, or more general mapping language • In this case, global constraints could be easier to describe

  14. Other advantages of GC • They “summarize” source contents • Mediators can check queries against these conditions, before checking individual sources (thus could avoid unnecessary source checking) • Users get an overview of the data (easier to ask queries) • Give “outside world” a view of the source contents • Especially useful when the mediation system is used as a component of a larger system • E.g., peer-based mediation systems (Raccoon, Piazza, Hyperion, PeerDB, …) or hierarchies of mediators

  15. Two kinds of global constraints • General global constraints • Source-derived global constraints

  16. General global constraints • Conditions that should be satisfied by any database instance of a global predicate: • e.g., C4: street is unique for all OC houses • It can be represented as the general global constraint on the house predicate • (street) forms a key of house predicate • Introduced during the system design to capture the semantics of the application domain • Future new sources expected to satisfy this constraint • Thus: may need to check if it is satisfied by existing and new-coming sources

  17. Source-derived global constraints (Example) • Local constraints: • C1: s1.price >= $250K • C3: s2.year < 1995, s2.price >= $200K  Global constraint C5: house.price >= $200K • C5 is true only when the system has the two sources • In general, there could be a house (not at s1 and s2) that is less than $250K • We don’t care about these houses, pretending they didn’t exist • When future sources come in, we need to check and update this constraint

  18. Source-derived global constraints • It is a condition on global predicates • It must be satisfied by any derived database D of any source view instances satisfying those local constraints. • Derived database D: certain tuples that can be decided based on the view definitions • Depends on the view definition (LAV, GAV, …) • s1(street, zip, price, sqft) :- house(street, zip, price, sqft, year), price >= 250K

  19. Computing Source-derived GC • Input: • Sources with their local constraints • Mappings between source schema and the mediator schema • Output: source-derived GC on mediator schema • We have preliminary results for the LAV approach

  20. Comparisons

  21. Conclusions • Describing constraints in data-integration system is important • We classified different types of constraints: • Local constraints • Global constraints: • General • Source-derived • We showed their advantages and limitations • We studied how to manipulate these constraints (e.g., compute source-derived global constraints from local constraints)

  22. Open problems • Expressive languages to describe cross-source constraints • “tuple-generating dependencies”? • Other simpler but powerful languages • Manipulating the constraints in general • LAV, GAV • more expressive mappings between sources and mediators • Efficient techniques for testing and re-computing of global constraints as sources change

  23. Work conducted in The RACCOON Project on Peer-based Data Integration and sharing,UC Irvine

  24. Acknowledgements We thank Jia Li for her help on the preparation of these slides.

More Related