240 likes | 350 Views
Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems. Chen Li Information and Computer Science University of California, Irvine. Constraints in Data Integration. Sources of Orange County (OC) housing information. Mediator. house(street,zip,price,sqft,year).
E N D
Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California, Irvine
Constraints in Data Integration • Sources of Orange County (OC) housing information Mediator house(street,zip,price,sqft,year) s2(street, price, year) s1(street, zip, price, sqft) • Constraints: • s1 price at least $250K • s2 price at least $280K • Query: • “Find Orange County houses cheaper than $200K” • Answer: empty, without checking the source table instances
Another Example Mediator house(street,zip,price,sqft,year) s2(street, price, year) s1(street, zip, price, sqft) • Constraint: OC houses have a unique street • Query: • “Find sqft and year of OC houses” • Plan: join two source relations on the “street” attributes • The plan is invalid if the constraint is not true.
Importance of Constraints • They express a rich amount of information about sources • They can be utilized to help query answering
Our contributions • There has been a lot of work on how to use constraints in query optimization, e.g.: • [Hsu, Knoblock, 2000] • [Godfrey, Gryz, Zuzarte, 2001] • … • To make this optimization possible, we primarily study: • how to describe constraints; and • how to manipulate constraints between “local” and “global”
Outline • Constraints motivation • Local constraints • Global constraints • Conclusions and open problems
Describing various constraints house(street,zip,price,year,sqft) • Examples: • C1: s1.price >= $250K • C2: s1.street is unique • C3: s2.year < 1995, s2.price >= $280K • C4: street is unique for all OC houses • C5: all houses in the system are at least $250K • Where to put them? • Some constraints can be described at sources “locally” (C1,C2,C3) • Others are more suitable to be described “globally” (C4,C5) s1(street, zip, price, sqft) s2(street, price, year)
Local constraints: described at sources • Designed by individual sources • Common local constraints: • Range constraints • “price >= $250K” • Enumeration constraints • “state in {CA, NV, AZ, OR}” • Functional dependencies • “(street, zip) (price, year)” • Key constraints • “(street) is a primary key of house predicate” • Foreign-key constraints among tables at a source • Inclusions • …
Other ways to describe local constraints • Could be described using traditional source-view definitions • E.g., in the LAV (local-as-view) approach to data integration: “C1: s1.price >= $250K” can be described as s1(street, zip, price, sqft) :- house(street, zip, price, sqft, year),price >= 250K
Then why not use view/query languages to describe local constraints? • View/query languages might not be expressive enough • E.g., functional-dependency constraints • Query answering could become complicated • “Conjunctive-query rewriting” is already NP • Arithmetic comparisons even require recursive queries • Describing constraints separately have advantages: • More expressive: can capture common knowledge • We care more about those simple, common constraints • Easier to understand and do reasoning
Limitations of local constraints house(street,zip,price,year,sqft) • C4: street is unique for all OC houses • Cannot describe C4 using two local constraints: • “street is a key at s1” & “street is a key at s1” • Wrong! Since they cannot restrict two relations together! • In particular, the following is still allowed • S1: (main, 92697, $300K, 2100) • S2: (main, $380K, 1993) s1(street, zip, price, sqft) s2(street, price, year)
Global constraints (GC) • Some constraints are more suitable to be described “globally” • Example: • C4: street is unique for all OC houses • C5: all houses in the system are at least $250K
Describing global constraints • Might need more expressive languages: • “street” is unique for houses at s1 and s2 • Need to formally define such a “global key” • Consider special cases: • We have source schemas and mediator schema • Mappings exist between them: e.g., LAV, GAV, GLAV, or more general mapping language • In this case, global constraints could be easier to describe
Other advantages of GC • They “summarize” source contents • Mediators can check queries against these conditions, before checking individual sources (thus could avoid unnecessary source checking) • Users get an overview of the data (easier to ask queries) • Give “outside world” a view of the source contents • Especially useful when the mediation system is used as a component of a larger system • E.g., peer-based mediation systems (Raccoon, Piazza, Hyperion, PeerDB, …) or hierarchies of mediators
Two kinds of global constraints • General global constraints • Source-derived global constraints
General global constraints • Conditions that should be satisfied by any database instance of a global predicate: • e.g., C4: street is unique for all OC houses • It can be represented as the general global constraint on the house predicate • (street) forms a key of house predicate • Introduced during the system design to capture the semantics of the application domain • Future new sources expected to satisfy this constraint • Thus: may need to check if it is satisfied by existing and new-coming sources
Source-derived global constraints (Example) • Local constraints: • C1: s1.price >= $250K • C3: s2.year < 1995, s2.price >= $200K Global constraint C5: house.price >= $200K • C5 is true only when the system has the two sources • In general, there could be a house (not at s1 and s2) that is less than $250K • We don’t care about these houses, pretending they didn’t exist • When future sources come in, we need to check and update this constraint
Source-derived global constraints • It is a condition on global predicates • It must be satisfied by any derived database D of any source view instances satisfying those local constraints. • Derived database D: certain tuples that can be decided based on the view definitions • Depends on the view definition (LAV, GAV, …) • s1(street, zip, price, sqft) :- house(street, zip, price, sqft, year), price >= 250K
Computing Source-derived GC • Input: • Sources with their local constraints • Mappings between source schema and the mediator schema • Output: source-derived GC on mediator schema • We have preliminary results for the LAV approach
Conclusions • Describing constraints in data-integration system is important • We classified different types of constraints: • Local constraints • Global constraints: • General • Source-derived • We showed their advantages and limitations • We studied how to manipulate these constraints (e.g., compute source-derived global constraints from local constraints)
Open problems • Expressive languages to describe cross-source constraints • “tuple-generating dependencies”? • Other simpler but powerful languages • Manipulating the constraints in general • LAV, GAV • more expressive mappings between sources and mediators • Efficient techniques for testing and re-computing of global constraints as sources change
Work conducted in The RACCOON Project on Peer-based Data Integration and sharing,UC Irvine
Acknowledgements We thank Jia Li for her help on the preparation of these slides.