310 likes | 431 Views
Interactive Query Formulation over Web Service-Accessed Sources. SIGMOD 2006 Best Paper Runner-Up. Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou. CSE 636 Data Integration, March 2008. Large-Scale Data Integration Systems. . Web Domain. Web Forms & Reports.
E N D
Interactive Query Formulationover Web Service-Accessed Sources SIGMOD 2006 Best Paper Runner-Up Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou CSE 636 Data Integration, March 2008
Large-Scale Data Integration Systems Web Domain Web Forms & Reports • CNET’s Top Combinations • CNET’s Search Desktops • PCWorld’s Product Finder End User Application Domain • CNET Computer • PCWorld Portals Developer Application Application Integration Domain Mediator Integrated Schema Compatible Combinations of Computers, Routers and Printers Integration Engineer Source Domain • Dell Computers by CPU • Cisco Routers by Rate • HP Printers by Speed Web Service Web Service Web Service … Source Owner Data Source Data Source • Dell Computers • Cisco Routers • HP Printers Source Schema Source Schema …
Large-Scale Data Integration Systems Web Domain Web Forms & Reports End User Application Domain What queries can the mediator answer for me? CLIDE Developer Application Application Integration Domain Mediator Integrated Schema Integration Engineer Source Domain Web Service Web Service Web Service … Source Owner Data Source Data Source Source Schema Source Schema …
Running Example Parameterized Views Schema Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface) Views V1ComByCpu(cpu) (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu V2 ComNetByCpuRate(cpu, rate) (Computer, NetCard)* SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate Dell Cisco Schema Routers(rate, standard, price, type) Views V3 RouWired() (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' V4 RouWireless() (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless' Wired Routers Computers for a givencpu Wireless Routers Computers & NetCards for a givencpu & rate • Conjunctive Queries CQ • Equality & Comparison Conditions • Parameters
Running Example Integrated Schema Developer Application Mediator Integrated Schema • Integrated schema puts togetherthe Dell and Cisco schemas Attribute Associations • (Computers.cid, NetCards.cid) • (NetCards.rate, Routers.rate) • (NetCards.standard, Routers.standard) V1 V2 V3 V4 Dell Cisco
Sophisticated Mediators MakeFeasible Queries Hard to Predict Feasible Queries FQ • Equivalent CQ query rewritings using the views • Might involve more than one views • Order might matter Query: Get all ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers Query: Get all Computers Feasible Infeasible E Mediator B D Mediator A C RouWireless() ComNetByCpuRate(‘P4’, ‘10’) V1 ComNetByCpuRate(‘P4’, ‘54’) V4 V2
Problem • Large number of sources • Large number of views (web-services) • Mediator capabilities Developer formulates an application query • Is an application query feasible? • If not, how do I know which ones are feasible? Previous options: • The developer had to browse the view definitions and somehow formulate a feasible query • Or formulate queries until a feasible one is found(trial-and-error) No system-provided guidance
The CLIDE Solution CLIDE • A query formulation interface, which interactivelyguides the developer toward feasible queries by employing a coloring scheme Developer Application Mediator Integrated Schema V1 V2 V3 V4 Dell Cisco
QBE-Like Interfaces Microsoft SQL-Server
CLIDE Interface Last/Next Step Table Alias Selection Boxes • Table, selection, projection and join actions • Feasibility Flag • Color-based suggestions Feasibility Flag Table Boxes Projection Box
Example Interaction Snapshot 1 Yellow required action • All feasible queries require this action White optional action • Feasible queries can be formulatedw/ or w/o these actions
Example Interaction Snapshot 2 • Blue required choice of action • At least one feasible query cannot be formulated unless this action is performed C Mediator A ComByCpu(‘P4’) B V1
Example Interaction Snapshot 3 Join Lines: • Only yellow and blue are displayed • Must appear in Attribute Associations
Example Interaction Snapshot 4 • * any other constant • Red prohibited action • Does not appear in any feasible query • Lead to “Dead End” state
Example Interaction Snapshot 5 F Mediator A D RouWireless() ComNetByCpuRate(‘P4’, rate) E B V4 V2
CLIDE Properties • Completeness of Suggestions • Every feasible query can be formulated by performing yellow and blue actions at every step • Summarization of Suggestions • At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness • Rapid Convergence By Following Suggestions • The shortest sequence of actions from a query to any feasible query consists of suggested actions
Interaction Graph Selection Action Table Action Join Action Com1 Com1.ram Com1.price Com1.cpu=‘P4’ Net1 Com1.cid=Net1.cid Rou1 … … … … … … … … … … • Nodes are queries: One for each qCQ • Edges are actions: Table, selection, projection and join actions • Green nodes are feasible queries • Infinitely big structure • All CQ queries • All possible combinations of actions formulating them
Interaction Graph: Colorable Actions Com1.cid Current Node … Com1.cpu … Com1.cid=* … • Colorable actionsAC labeloutgoing edges of the current node Com1.cpu=* … Com1.ram=* … Com1.price=* … Net1 … Rou1 … Com2 …
Interaction Graph: Colors • Yellow action • Every path from current node n to a feasible node contains • Blue action • At least one feasible query cannot be formulated unless this action is performed (summarization) • Red action • No path to a feasible node contains Current Node Com1.cid … Com1.cpu … Current Node Com1.cid=* … … Com1.cpu=* Com1.cpu=* Net1 Com1.cid=Net1.cid Com1.cid=Net1.cid Net1.rate=’54Mbps’ … … … … Com1.ram=* Com1.cid=Net1.cid … … Net1.rate=’54Mbps’ Net1.rate=’54Mbps’ Com1.price=* … … … Com1.cpu=* Rou1 Rou1 Com1.cid=Net1.cid Net1.rate=Rou1.rate Net1 … … … … … … Rou1 Com2 Com2 … Com2 Com2.cid=Net1.cid Com2.cpu=‘P4’ Net1.rate=‘54Mbps’ … … … … …
CLIDE Architecture Actions • Back-End invoked every time the user performs an action • i.e., the user arrives at a new node in the interactions graph Front-End User Current Query Colored Actions + Feasibility Flag Back-End Color Algorithm Seed Queries SQ Parameters Algorithm Closest Feasible Queries FQC Closest Feasible Queries Algorithm Aliases Collapse Rule Minimal Feasible Extension Queries Maximally-Contained Rewriter Schemas Views Column Associations
Color DeterminedBy a Finite Set of Feasible Queries Challenge: Infinitely Many Feasible Queries ? … Radius … Closest Feasible Queries FQC … n … … … … Solution: Closest Feasible Queries FQC • FQC is sufficient to color actions in AC • Theorem: Set of Closest Feasible Queries is Finite Challenge: How far can the Closest Feasible Queries FQC be? Solution: Based on Maximally Contained Queries FQMC
Maximally Contained Queries FQMC Not Maximally Contained Query: Q3 Get all Computers with a given cpu & ram Maximally Contained Query Query: Q4 Get all Computers with a given ram Maximally Contained Query Query: Q2 Get all Computerswith a given cpu Query: Q1 Get all Computers • Assuming fixed SELECT clause (projection list) • Covered extensively in literature • MiniCon, Bucket, InverseRules Algorithms • FQMC is finite
Closest Feasible Queries FQC Algorithm Challenge: How far can the Closest Feasible Queries FQC be? Solution: Maximally Contained Queries FQMC pLRadius … Maximally Contained Queries FQMC Closest Feasible Queries FQC … … n … … … • Compute maximally contained queries FQMC • Theorem: All FQC queries are reachable via a path of length p pL • The radius pL is the longest path to a maximally contained query …
Closest Feasible Queries FQC Algorithm Challenge: Find the Closest Feasible Queries Maximally Contained Feasible Queries FQMC … Closest Feasible Queries FQC … n … … … … More feasible nodes • Theorem: All queries in FQMC are in FQC • But not all queries in FQC are in FQMC
Closest Feasible Queries FQC Algorithm Solution: Collapse Aliases Maximally Contained Feasible Queries FQMC … Closest Feasible Queries FQC … n … … … … • Collapse Aliases to compute FQC \ FQMC • Check satisfiability
Color Algorithm Yellow and Blue • An action is colored based on which closest feasible queries it appear in • Yellow, if appears in all queries in FQC • Blue, if appears in at least one (but not all) query in FQC White and Red • Attach Maximum Projection Lists to Closest Feasible Queries • Projections that can be added to a feasible query, without compromising feasibility • Projection is white if in the maximum projection list • Color selections based on projections
CLIDE Implementation & Optimizations Front-End Maximally-Contained Rewriter Current Query Colored Actions + Feasibility Flag Minimal Feasible Extension Queries FQME + Maximum Projection Lists Back-End Redundant Actions Removal Color Algorithm Maximally-Contained Feasible Extension Queries + Maximum Projection Lists Seed Queries SQ Redundant Queries Removal Parameters Algorithm Feasible Extension Queries + Maximum Projection Lists Closest Feasible Queries FQC Closest Feasible Queries Algorithm Views Expansion Aliases Collapse Rule Maximally-Contained Feasible Queries over Views + Containment Mappings Minimal Feasible Extension Queries Maximally-Contained Rewriter MiniCon Containment Mappings Logging Schemas Views Column Associations • Views expansion introduce redundancy • Affects CLIDE’s rapid convergence and summarization • Efficient containment test crucial to redundancy removal
CLIDE Performance • Views • Schema C1 C1 … … B1 B1 Ci Ci1 B1 C1 … … … … … CiM CL CL A A Bi Bi1 B2 C1 … … … … BiM BK BK Chains of Stars – No Parameters • Queries A-span = 7 B-span = 3 Selections = 4,6,8,10 A
CLIDE Performance • Views • Schema C1 C1 … … B1 B1 Ci Ci1 B1 C1 … … … … … CiM CL CL A A Bi Bi1 B2 C1 … … … … BiM BK BK Chains of Stars – No Parameters • Queries A-span = 7 B-span = 3 Selections = 4,6,8,10 A
CLIDE Performance • Views • Schema C1 C1 … … B1 B1 Ci Ci1 B1 C1 … … … … … CiM CL CL A A Bi Bi1 B2 C1 … … … … BiM BK BK Chains of Stars – With Parameters • Queries A-span = 7 B-span = 3 Selections = 4,6,8,10 A
CLIDE Summary First interactive query formulation interface based on source and mediator capabilities Applicability • Service-Oriented Architectures • Privacy-Preserving Services Contributions • Interaction Guarantees: Rapid Convergence, Completeness, Summarization of Suggestions • Interaction Graph • Back-End Algorithms • Closest Feasible Queries, Colors, Parameters • Modular, Customizable Architecture http://www.clide.info