760 likes | 1.04k Views
Planning for the Web II Execution & Service Integration. Dan Weld University of Washington June, 2003. Acknowledgements. Oren Etzioni Yolanda Gil Keith Golden Alon Halevy Zack Ives Tal Shaked. Caveat. Outline. Execution for Data Integration Coping with incomplete statistics, latency
E N D
Planning for the Web IIExecution & Service Integration Dan Weld University of Washington June, 2003
Acknowledgements • Oren Etzioni • Yolanda Gil • Keith Golden • Alon Halevy • Zack Ives • Tal Shaked Caveat
Outline • Execution for Data Integration • Coping with incomplete statistics, latency • Interleaved planning & execution • Convergent query processing • Service Integration • Web service composition • Background • Representational issues • Planning algorithms • Automated data analysis
Optimization and Execution • Problem: • Few and unreliable statistics about the data. • Unexpected (possibly bursty) network transfer rates. • Generally, unpredictable environment. • General solution: (research area) • Adaptive query processing. • Interleave optimization and execution. As you get to know more about your data, you can improve your plan.
Adaptivity & Incremental Processing Query Performance Evaluated within the Tukwila system [Ives PhD]
Query Optimization: Model Query Plans’ Execution & Choose the Best ROS ~270 tuples 50 sec ROS ~270 tuples 30 sec OS ~15 tuples RO ~30 tuples Estimates, assumptions introduce error: • Exponential increase in estimation error with each join [Ioannidis & Christodoulakis 91] [Antoshekov 93,96] • Worse if no detailed statistics op op op Shipping (S) 90 tuples op Restock (R) 100 tuples Restock (R) 100 tuples Orders (O) 50 tuples Shipping (S) 90 tuples Orders (O) 50 tuples From source sizes, stats, estimate result sizes, costs
Why Does Data Integration Make Optimization Harder? Query optimization estimates costs using knowledge about environment and data: • Data source sizes (“cardinalities”) Often unavailable or not meaningful in data integration • Histograms Too expensive to maintain in data integration • I/O costs Network I/O costs fluctuate Need a way to gain this sort of knowledge!
Some Solutions • Adaptive operators • Mid query reoptimization • Convergent query processing • Query scrambling [Franklin et al.] • Eddies [Hellerstein et al.]
Tukwila Data Integration System Novel components: • Event handler • Optimization-execution loop • Adaptive operators
Double Pipelined Join Hybrid Hash Join • No output until build relation read • Asymmetric (build vs. probe) — optimization requires source behavior knowledge Double Pipelined Hash Join • Outputs data immediately • Symmetric — requires less source knowledge to optimize • Threads overlap I/O, computation
Performance on Networked Data Join of 3 tables sent via JDBC over 10Mb Ethernet: TPC-H Lineitem Supplier Order Time (sec) Tuples Output (1000s)
Double Pipelined Join in Summary Benefits: • Easier to optimize (symmetric) • Sub-operations scheduled flexibly • Allows overlap of I/O and computation Incurs some overhead: • Threading, queues • Required extensions to intelligently handle overflow: • Same hash function, number of buckets for each side • Approaches: flush buckets on left side or flush symmetrically
Some Solutions • Adaptive operators • Mid-query reoptimization • Interleaved planning and execution • Convergent query processing • Query scrambling • Eddies
Mid-query reoptimization AB D C D C B A Materialization Point: write AB to disk If actual predicted statistics replan [Kabra & DeWitt]
Some Solutions • Adaptive operators • Mid query reoptimization • Convergent query processing • Query scrambling • Eddies
Convergent Query Processing • Instead of adapting remainder of plan • after executing all data on plan prefix • Adapt whole plan • after executing whole plan on part of data • Can better gather information this way…
Convergent Query Processing in Action: Changing Join Plans in Mid-Stream R2 O2S2 R0 O0S0 R1 O1S1 “Cleanup” query plan R2O2 R0 S0 O1S1 Join Restock,Orders, Shipping (R O S) ROS RS
Breaking a Join into Phases: One Subset per Table, Each Phase Cleanup Phase R0 Phase 0 O0 O1 Phase 1 O0 O1 R1 Restock (R) Orders (O)
The Cleanup Plan Reuses PreviousWork Where Possible Exclude R0S0O0, R1S1O1, R2S2O2, Exclude R2O2 Restock Orders Shipping R2 O2S2 R0 O0S0 R1 O1S1 R2O2 R0 S0 O1S1
CQP on a 100Mbps LAN: Nearly “Optimal” Performance 866MHz P-III, 256MB buffer pool, re-optimization every 10sec cost to parse XML
Slow WAN, Faster CPU: CQP Reduces Work 1GHz P-III, 256MB, re-optimization every 10sec. 1Mbps network, RTT ~50msec
Outline • Execution for Data Integration • Coping with incomplete statistics, latency • Interleaved planning & execution • Convergent query processing • Service Integration • Web service composition • Background • Representational issues • Planning algorithms • Automated data analysis
What is a Web Service • A web service is a network accessible interface to application functionality, built using standard Internet protocols (TCP/IP, XML, SOAP, WSDL… • Clients of a web service do NOT need to know how it is implemented. • Why interesting? • Increased automation Web Service Network Application code Application client
Case Study: Amazon • Services Exported • Product details (short, long, images, samples) • Purchase functionality • Ratings, reviews, collaborative filtering data, lists, … • Examples • Store builder tools • Amazon Browser – visualization tool • Windows desktop interfaces – drag-n-drop… • MP3 Piranha • Games • Automatic review writer??
Case Study: Google • Services Exported • Search interface • Limits on items returned, queries / day • Examples • Metacrawler functionality • Geosearch ‘nearby thai restaurants’ • TIGER, FIPs -> lat,long of pages • Robust hyperlinks • Creates a signature for destination pages & tracks with query
Case Study: Fed Express • Shipment tracking • Proof of delivery • Invoice reviewed, adjusted, settled • Schedule pickup time, location • Outgoing or returns • Order supplies (airbills, envelopes, boxes) • Review shipping history • Rate requests • Location, package size • International trade • Required documents, duties, taxes
Case Study: Hailstorm / MyServices • Web Services • MyDocuments • MyAddressbook • MyWallet • MyNotifications …. • Scenario • Wallet keeps receipts, arranges product return • Expedia uses notifications to warn of canceled flight • Reality • Ebay, AmEx, Groove, …
Case Study: OAA • Common schema for travel industry • Reservations • Flights, trains, rental cars, hotels • Time & distances • Payment, deposits, vouchers • Vacation Packages
Web Service Technology Stack shopping web service? WSDL URIs Web Service Client UDDI Discovery Web Service Description WSDL WSDL SOAP pkg request Packaging Proxy SOAP pkg response Transport Network
SOAP (Simple Object Access Protocol) • SOAP Messages • XML Payload • Using SOAP as RPC (Remote Procedure Call) Messages SOAP client SOAP server Request message Response message
If a WS were a Phone Call… • XML • represents the conversation, • SOAP • describes the rules for how to call someone • UDDI • is the phone book. • WSDL • describes what the phone call is about and how you can participate.
WSDL for int foo(int arg); <types> <schema targetNamespace="http://tempuri.org/xsd" xmlns="http://www.w3.org/2001/XMLSchema" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:wsdl="http://schemas...l/" elementFormDefault="qualified" > </schema> </types> <message name="Simple.foo"> <part name="arg" type="xsd:int"/> </message> <message name="Simple.fooResponse"> <part name="result" type="xsd:int"/> </message> <portType name="SimplePortType"> <operation name="foo" parameterOrder="arg" > <input message="wsdlns:Simple.foo"/> <output message="wsdlns:Simple.fooResponse"/> </operation> </portType>
DISCO • If you know the URL for a service • DISCO lets you query them • And get back a WSDL description • But what if you don’t know the right URL?
UDDI • Hosted Registries • Microsoft, IBM, HP, SAP, NTT, BEA • Entries defined with • Business information • Name, contacts, descriptions, identifier, yellow pages category • Service information • Entities, each of which describes a family of related services which together implement a business process • Binding information • How to invoke: URI, required parameters, options, & Tmodel • Service specifications (Tmodel) • As a symbol – fingerprint to recognize a known service • Decomposable to find WSDL description
Acronyms (W3C, MSFT,IBM) WSFL XLANG • BPEL4WS • UDDI • Discover, describe, register services • SOAP-based service for locating WSDL-formatted service descriptions • DISCO • Discover / retrieve SCL+SDL descrips • SDL / NASSL • SOAP description lang –get params / types • SCL • SOAP contract lang – extends SDL – orchestration of msgs • WSDL • Describe abstract interface and protocol bindings of arbitrary network services (extends scl) • XLANG / WSFL / BPEL4WS • lang for biz processes used in BizTalk • Biz process execution language for web services • MSFT, IBM, BEA proposal SDL NASSL SCL WSDL
RDF (Resource Description Framework) Way to describe resources via metadata Makes no assumptions about a particular application domain Based on XML Another one? Standard for semantic web Restricts resource descriptions to triplets (subject,predicate,object) Provides a lightweight ontology system Subproperty, Subclass, Domain & Range
DAML+OIL (www.daml.org) • DAML extends RDF and RDFS with richer modeling primitives. • disjointWith, intersectionOf, oneOf, cardinality • Able to provide properties of properties • uniqueness, transitivity, etc.
DAML-S DAML+OIL ontology describing Web Services Complements low level descriptions like WSDL • Describes what and why a service operates, • Not just how to communicate with it. Goals: Discovery, Invocation, Composition, • Verification, Execution Monitoring (mapping to WSDL)
Outline • Execution for Data Integration • Coping with incomplete statistics, latency • Interleaved planning & execution • Convergent query processing • Service Integration • Web service composition • Background • Representational issues • Planning algorithms • Automated data analysis
Partial Survey of Planners • UW Internet Softbot • Planners: SENSp / XII / PUCCINI • Repr. languages: UWL / SADL ; LCW • PKS • Planning at the knowledge level • McDermott • Forward-chaining search w/ GRG guidance • McIlraith et al. • ConGolog (procs, loops, conditionals, w/ nondet • Papazoglou, Traverso et al. • Stratified service arch; XSRL language; MBP • Finin; Srivastava; Knoblock; Ambite; Nau…
Planning for image processing tasks MODIS FPAR Com- posit Re- project Mosaic Daily 8-day LAZEA FPAR MODIS LAI Com- posit Re- project Mosaic Daily 8-day LAZEA LAI GOES Soil Radiation Statistics Soil Moisture RUC2 WGRIB Drill- down Min, Max Temp Land Surface Models GRIB bin Snow cover Mean Precip. Mean wind Stream flow NPP False Color Phenology Topography Inputs Filters Models Visualization • Many fielded systems • Lansky’s COLLAGE , Chien et al. MVP/ASIP, • Golden ADLIM, Blythe GRID… • Spatial representations important
Motivating Scenarios Planning a trip Yahoo maps -> driving time -> travel prefs Automatic expense form filing Purchasing a group of items Aggregation from multiple vendors Select for: payment types, stock level, deliv Local & 3rd party reputation services (BBB) Monitoring marketplace Auction sites Events (check calendar / notification service
UW Internet Softbot • Software robot • Effectors mv, ftp, chmod, cd, lpr, rm, ... • Sensors ls, finger, INSPEC, netfind, wc, ... • Say what we want, nothow to do it • Find phone numbers, fetch/print online papers, … • Integrate multiple resources
Motivation/Contributions • Represent actions like ls, finger • Represent goals such as • “Rename paper.tex to kr.tex” • “Print all files in directory papers.” (even with incomplete information) • No previous system could express
The Middle Ground 1. Action Representation 2. Knowledge Representation
Softbot Architecture Task Manager SADL Actions LCW Knowledge PUCCINI Planner Sensors Effectors UNIX shell & WWW
SADL Family Tree [Fikes & Nilsson, 71] STRIPS [Etzioni et al, 92] [Pednault, 89] ",Conditional Effects Incomplete info, Noise-free sensors UWL ADL SADL Represents ls, “Rename”, finger... [Golden & Weld, 96]
SADL/UWL Annotations Goal annotations: satisfy = achieve by any means hands-off = don’t change (maintenance) Effect annotations cause = change world observe = change agent’s knowledge “Delete the file named junk” satisfy (name (ƒ, junk)) Ù satisfy(deleted (ƒ))
Information Goals are Temporal • Two time points • When proposition sampled • When reply given • “Tell me now who was President in 1883” • “Tell me tomorrow who is President now” • “Identify (ASAP) the file now named `junk’”