490 likes | 660 Views
Rapidly Constructing Integrated Applications from Online Sources. Craig A. Knoblock Information Science Institute University of Southern California. BiddingForTravel.com. Priceline. Map. Orbitz. Motivating Example. ?. Outline. Extracting data from unstructured and ungrammatical sources
E N D
Rapidly Constructing Integrated Applications from Online Sources Craig A. Knoblock Information Science Institute University of Southern California
BiddingForTravel.com Priceline Map Orbitz Motivating Example ?
Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans
Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans
Ungrammatical & Unstructured Text For simplicity “posts” Goal: <hotelArea>univ. ctr.</hotelArea> <price>$25</price> <hotelName>holiday inn sel.</hotelName> Wrapper based IE does not apply (e.g. Stalker, RoadRunner) NLP based IE does not apply (e.g. Rapier)
Reference Sets IE infused with outside knowledge “Reference Sets” • Collections of known entities and the associated attributes • Online (offline) set of docs • CIA World Fact Book • Online (offline) database • Comics Price Guide, Edmunds, etc.
Our Record Linkage Problem • Posts not yet decomposed attributes. • Extra tokens that match nothing in Ref Set. Post: “$25 winning bid at holiday inn sel. univ. ctr.” hotel name hotel area Reference Set: hotel name hotel area
Our Record Linkage Solution P = “$25 winning bid at holiday inn sel. univ. ctr.” Record Level Similarity + Field Level Similarities VRL= < RL_scores(P,“Hyatt Regency Downtown”), RL_scores(P,“Hyatt Regency”), RL_scores(P,“Downtown”)> Binary Rescoring SVM Best matching member of the reference set for the post
Extraction Algorithm Post: $25 winning bid at holiday inn sel. univ. ctr. VIE= <common_scores(token), IE_scores(token, attr1), IE_scores(token, attr2), … > Generate VIE Multiclass SVM $25 winning bid at holiday inn sel. univ. ctr. price hotel name hotel area $25 holiday inn sel. univ. ctr. Clean Whole Attribute
Experimental Data Sets Hotels • Posts • 1125 posts from www.biddingfortravel.com • Pittsburgh, Sacramento, San Diego • Star rating, hotel area, hotel name, price, date booked • Reference Set • 132 records • Special posts on BFT site. • Per area – list any hotels ever bid on in that area • Star rating, hotel area, hotel name
Comparison to Existing Systems Record Linkage • WHIRL • RL allows non-decomposed attributes Information Extraction • Simple Tagger (CRF) • State-of-the-art IE • Amilcare • NLP based IE
Record linkage results 10 trials – 30% train, 70% test
Token level Extraction results: Hotel domain Not Significant
Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans
lowestFare(“MXP”,“PIT”) Query Reformulated Query Reformulated Query SELECT MIN(price) FROM flight WHERE depart=“MXP” AND arrive=“PIT” calcPrice(“MXP”,“PIT”,”economy”) new service Alitalia Discovering Models of Sources Required for Integration • Provide uniform access to heterogeneous sources • Source definitions are used to reformulate queries • New service, no source model, no integration! • Can we discover models automatically? Web Services United Mediator Lufthansa • Source • Definitions: • United • Lufthansa • - Qantas Qantas ?
known source rate LatestRates($country1,$country2,rate):- exchange(country1,country2,rate) new source currency RateFinder($fromCountry,$toCountry,val):- ? Inducing Source Definitions:A Simple Example • Step 1: use metadata to classify input types • Step 2: invoke service and classify output types Mediator Semantic Types: currency {USD, EUR, AUD} rate {1936.2, 1.3058, 0.53177} Predicates: exchange(currency,currency,rate) {<EUR,USD,1.30799>,<USD,EUR,0.764526>,…}
rate match currency Inducing Source Definitions:A Simple Example • Step 3: generate plausible source definitions • Step 4: reformulate in terms of other sources • Step 5: invoke service and compare output new source RateFinder($fromCountry,$toCountry,val):- ? def_1($from, $to, val) :- exchange(from,to,val) def_2($from, $to, val) :- exchange(to,from,val) Mediator def_1($from, $to, val) :- LatestRates(from,to,val) def_2($from, $to, val) :- LatestRates(to,from,val) Predicates: exchange(currency,currency,rate)
The Framework Intuition:Services often have similar semantics, so we should be able to use what we know to induce that which we don’t Two phase algorithm For each operation provided by the new service: • Classify its input/output data types • Classify inputs based on metadata similarity • Invoke operation & classify outputs based on data • Induce a source definition • Generate candidates via Inductive Logic Programming • Test individual candidates by reformulating them
Use Case: Zip Code Data • Single real zip-code service with multiple operations • The first operation is defined as: • Goal is to induce definition for a second operation: • Same service so no need to classify inputs/outputs or match constants! getDistanceBetweenZipCodes($zip1, $zip2, distance) :- centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance). getZipCodesWithin($zip1, $distance1, zip2, distance2) :- centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 300).
INVALID d2 unbound! #d is a constant UNCHECKABLE lt1 inaccessible! contained in defs 2 & 4 Generating definitions: ILP • Want to induce source definition for: • Predicates available for generating definitions: {centroid, distanceInMiles,≤,=} • New type signature contains that of known source • Use known definition as starting point for local search: getZipCodesWithin($zip1, $distance1, zip2, distance2) getDistanceBetweenZipCodes($zip1, $zip2, distance) :- centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance).
Preliminary Results Settings: • Number of zip code constants initially available: 6 • Number of samples performed per trial: 20 • Number of candidate definitions in search space: 5 Results: • Converged on “almost correct’’ definition!!! • Number of iterations to convergence: 12 getZipCodesWithin($zip1, $distance1, zip2, distance2) :- centroid(zip1, lat1, long1), centroid(zip2, lat2, long2), distanceInMiles(lat1, long1, lat2, long2, distance2), (distance2 ≤ distance1), (distance1 ≤ 243).
Related Work • Classifying Web Services (Hess & Kushmerick 2003), (Johnston & Kushmerick 2004) • Classify input/output/services using metadata/data • We learn semantic relationships between inputs & outputs • Category Translation (Perkowitz & Etzioni 1995) • Learn functions describing operations available on internet • We concentrate on a relational modeling of services • CLIO (Yan et. al. 2001) • Helps users define complex mappings between schemas • They do not automate the process of discovering mappings • iMAP (Dhamanka et. al. 2004) • Automates discovery of certain complex mappings • Our approach is more general (ILP) & tailored to web sources • We must deal with problem of generating valid input tuples
Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans
(1). SwissProtein: P36246 (2). GeneBank: AAS60665.1 ……… Find information about all proteins that participate in Transcription process Dynamically Building Integration Plans Traditional Data Integration Techniques Mediator
Create a web service that accepts a name of a biological process, <bname>, and returns information about proteins that participate in it New web service Dynamically Building Integration Plans (Cont’d) Problem Solved Here Mediator
Problem Statement (Cont’d) • Assumption • Information-producing web service operations • Applicability • Biological data web services • Geospatial services (WMS, WFS) • Other applications that do not focus on transactions
Query-based Web Service Composition • Query-based approach • View web service operations as source relations with binding restrictions • Can be inferred from WSDL • Create domain ontology • Describe source relations in terms of domain relations • Combined Global-as-View / Local-as-View approach • Use data integration system to answer user queries
Template-based Web Service Composition • Our goal is to compose new web services • We need to answer template queries, not specific queries • Template-based Query Approach • Generate plans to take into account general parameter values, • i.e. Universal Plan [Schoppers, et. al.] • Easy to generate universal plan • Plans that answer template query as oppose to specific query • But, plans can be very inefficient • Need to generate optimized “universal integration plans”
Example Scenario • Sources HSProtein($id, name, location, function, seq, pubmedid) MMProtein($id, name, location, function, seq, pubmedid) Protein TranducerProtein($id, name, location, taxonid, seq, pubmedid) MembraneProtein($id, name, location, taxonid, seq, pubmedid) DipProtein($id, name, location, taxonid, function) Protein-Protein Interactions MMProteinInteractions($fromid, toid, source, verified) HSProteinInteractions($fromid, toid, source, verified)
Example Rules and Query ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified) Q(fromid, toid, taxonid, source, verified):- fromid = !fromid, taxonid = !taxonid, ProteinProteinInteractions(fromid, toid, taxonid, source, verified)
Optimized Plan • Exploit constraints in source description to filter queries to sources
Example Scenario • Q1(fromid, fromname, fromseq, frompubid, toid, toname, toseq, topubid):- fromid = !fromproteinid, Protein(fromid, fromname, loc1, f1, fromseq, frompubid, taxonid1), ProteinProteinInteractions(fromid, toid, taxonid, source, verified), Protein(toid, toname, loc2, f2, toseq, topubid, taxonid2) Output Input Fromproteinid, fromseq, Toproteinid, toseq Fromproteinid ComposedPlan Protein Fromproteinid, fromseq Join Fromproteinid, Toproteinid, toseq Protein-Protein Interactions Fromproteinid, Toproteinid Protein
Adding Sensing Operations for Tuple-level Filtering • Compute original plan for a template query • For each constraint on the sources • Introduce constraint into the query • Rerun inverse rules algorithm • Compare cost of new plan to original plan • Save plan with lowest cost
Outline • Extracting data from unstructured and ungrammatical sources • Automatically discovering models of sources • Dynamically building integration plans • Efficiently executing the integration plans
Dataflow-style, Streaming Execution • Map datalog plans into streaming, dataflow execution system (e.g., network query engine) • We use the Theseus execution system since it supports recursion • Key challenges • Mapping non-recursive plans • Mapping recursive plans • Data processing • Loop detection • Query results update • Termination check • Recursive callback
Example Translation ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- HSProteinInteractions(fromid, toid, source, verified),(taxonid=9606) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- MMProteinInteractions(fromid, toid, source, verified), (taxonid=10090) ProteinProteinInteractions(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, itoid, taxonid, source, verified), ProteinProteinInteractions(itoid, toid, taxonid, source, verified) Q(fromid, toid, taxonid, source, verified):- ProteinProteinInteractions(fromid, toid, taxonid, source, verified), (fromid = !fromproteinid), (taxonid = !taxonid)
Bio-informatics Domain Results • Experiments in Bio-informatics domain where we have 60 real web services provided by NCI • We varied number of domain relations in a query from 1-30 and report composition time with execution time
Tuple-level Filtering • Tuple-level filtering can improve the execution time of the generated integration plan by up to 53.8%
Improvement due to Theseus • Theseus can improve the execution time of the generated web service with complex plans by up to 33.6%
Discussion • Huge number of sources available • Need tools and systems that support the dynamic integration of these sources • In this talk, I described techniques for: • Extracting data from unstructured and ungrammatical sources • Discovering models of online sources required for integration • Dynamic and efficient integration of web sources • Efficient execution of integration plans • Much work still left to be done…
More information… • http://www.isi.edu/~knoblock • Matthew Michelson and Craig A. Knoblock.Semantic Annotation of Unstructured and Ungrammatical TextIn Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005 • Mark James Carman and Craig A. Knoblock. Inducing source descriptions for automated web service composition, In Proceedings of the AAAI 2005 Workshop on Exploring Planning and Scheduling for Web Services, Grid, and Autonomic Computing, 2005. • Snehal Thakkar, Jose Luis Ambite, and Craig A. Knoblock. Composing, optimizing, and executing plans for bioinformatics web services, VLDB Journal, Special Issue on Data Management, Analysis and Mining for Life Sciences, 14(3):330--353, Sep 2005.