470 likes | 578 Views
Constraint-based Information Integration. Steven Minton Fetch Technologies Joint work with Craig Knoblock and Jose Luis Ambite (USC/ISI). Geocoder. Tiger Map Server. Integration System. LA County Restaurant Health Ratings. Zagat Restaurants Guide. Example Application. Outline.
E N D
Constraint-based Information Integration Steven Minton Fetch Technologies Joint work with Craig Knoblock and Jose Luis Ambite (USC/ISI)
Geocoder Tiger Map Server Integration System LA County Restaurant Health Ratings Zagat Restaurants Guide Example Application
Outline • Agents that access information sources on the web • AgentBuilder – learning from examples • ActiveAtlas -- standardizing data from multiple sources • Constraint-based Integration • Heracles – putting it all together
Decision Support Application Programs Information Agent The Web Databases Knowledge Bases Computer Programs Information Agents
Web Agents • Web agents provide uniform query language for data access: “Wrapping a web site”
AgentBuilder • Supervised learning: Extraction rules created from examples • High precision • High reliability
Extraction technology • Expressive extraction rule language: • Extraction rule = sequence of landmarks • Describes how to find the beginning and end of each field
A Sequential Covering Algorithm for “Wrapper Induction” Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …
A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … Initial candidate:SkipTo( ( )
A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … SkipTo( <b> ()...SkipTo(Phone) SkipTo( () ...SkipTo(:) SkipTo(() Initial candidate:SkipTo( ( )
A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … SkipTo( <b> ()...SkipTo(Phone) SkipTo( () ...SkipTo(:) SkipTo(() Initial candidate:SkipTo( ( ) … SkipTo(Phone) SkipTo(:) SkipTo( () ...
Outline • Agents that access information sources on the web • AgentBuilder – learning from examples • Atlas -- standardizing data from multiple sources • Constraint-based Integration • Heracles – putting it all together
Zagat’s Restaurant Guide Health Dept Restaurant Listings Art’s Deli California Pizza Kitchen Campanile Citrus Grill, The Philippe The Original Spago Art’s Delicatessen Ca’ Brea CPK The Grill Patina Philippe’s The Original The Tillerman How can the same objects be identified when they are stored in inconsistent text formats? The Problem: Multi-Source Inconsistency
The Solution: Record Linkage Zagat’s Restaurants Dept. of Health Name Street Phone Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa's103 1st Ave. between 6th and 7th Sts. 212/228-0604 Binion's Coffee Shop 128 Fremont St. 702/382-1600 Les Celebrites 160 Central Park S 212/484-5113 Name Street Phone Art’s Deli 12224 Ventura Boulevard 818-756-4124 Teresa's 80 Montague St. 718-520-2910 Steakhouse The 128 Fremont St. 702-382-1600 Les Celebrites 155 W. 58th St. 212-484-5113
Art’s Deli 12224 Ventura Boulevard 818-756-4124 Teresa’s 80 Montague St. 718-520-2910 Steakhouse The 128 Fremont St. 702-382-1600 Les Celebrites 155 W. 58th St. 212-484-5113 Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa’s 103 1st Ave. between 6th and 7th Sts. 212/228-0604 Binion’s Coffee Shop 128 Fremont St. 702/382-1600 Les Celebrites 5432 Sunset Blvd 212/484-5113 Query Record Linkage Zagat’s Agent Dept. of Health Agent Zagat’s Dept of Health Name Street Phone Name Street Phone
Approach to Record Linkage • Learning attribute weighting rules • Learning general transformation rules Name Street Phone
Active Learning to Determine Matched Records[Tejada, Knoblock, Minton ’01,’02] • Learn importance of attributes for matching records Name Street Phone Zagat’s Art’s Deli 12224 Ventura Boulevard 818-756-4124 Dept of Health Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Mapping rules: Name > .9 & Street > .87 => mapped Name > .95 & Phone > .96 => mapped
Active AtlasMapping Rule Learner Label Choose initial examples Generate committee of learners USER Learn Rules Learn Rules Learn Rules Classify Examples Classify Examples Classify Examples Votes Votes Votes Choose Example Label Set of MappedObjects
Committee Disagreement • Chooses an example based on the disagreement of the query committee • CPK, California Pizza Kitchen is the most informative example Committee Examples M1 M2 M3 Art’s Deli, Art’s Delicatessen CPK, California Pizza Kitchen Ca’Brea, La Brea Bakery Yes Yes Yes Yes No Yes No No No
Outline • Agents that access information sources on the web • AgentBuilder – learning from examples • ActiveAtlas -- standardizing data from multiple sources • Constraint-basedIntegration • Heracles – putting it all together
Constraint-based Integration • Integrating data from multiple sources often involves reasoning about the information • Constraints provide a approach to expressing relationships and filtering data
Heracles • Framework for building integrated applications • Interleaves planning and information gathering • Uses a constraint reasoner to decide what sources to query and to integrate the results
BLACK BLACK GREEN GREEN GREEN GREEN BLUE GREEN GREEN GREEN GREEN BLUE RED RED RED RED GREEN GREEN RED GREEN GREEN RED RED RED RED GREEN GREEN RED Dynamically Updates Slots as Information Becomes Available
Constraint Networks for Managing Information • Constraint reasoning system • Propagates information • Decides when to launch information requests • Evaluate constraints • Computes preferences • All run as asynchronous processes to support the user • Components: • Representation of the variables • Representation of constraints • Hierarchical templates • Constraint propagation
Constraint Networks for Integrating Information • Components: • Representation of the variables • Representation of constraints • Hierarchical template representation • Constraint propagation and cycle detection
Constraint Variables • Constraint network consists of a set of variables such as: • MeetingStartTime • MeetingLocation • Variables are related by constraints that determine the possible values of a solution
Constraint Networks for Integrating Information • Components: • Representation of the variables • Representation of constraints • Hierarchical template representation • Constraint propagation and cycle detection
Constraint Representation • Constraints are computable components: • Local calculations (e.g., Xquery) • MeetingStartTime + MeetingDuration --> MeetingEndTime • Web and Database Wrappers • ITN: DepartureAirport, ArrivalAirport, Date --> Flights • Yahoo Weather: City, Date --> Weather predication • External Programs (Outlook, Planners, etc) • Outlook Calendar: Date --> Meetings • Results cached in tables
OriginAddress Sep 30, 2000 GetDistance DepartureDate DestinationAddress 15.1 miles Oct 2, 2000 Distance ReturnDate FindClosestAirport LAX GetTaxiFare computeDuration DepartureAirport 3 days $21.00 $23.00 getParkingRate Duration ParkingTotal TaxiFare ParkingRate multiply SelectModeToAirport $7.00/day ModeToAirport Drive Drive or Take a Taxi?
Constraint Networks for Integrating Information • Components: • Representation of the variables • Representation of constraints • Hierarchical template representation • Constraint propagation and cycle detection
Hierarchically-Partitioned Constraint Networks • Template: • Groups related variables and constraints • Organizes information for computation and presentation to user • Templates organized hierarchically • Template decomposed into subtemplates • Choose among alternative subtemplates
Template Structure Template • Arguments: input and output variables • Variables: name, type, default values • Constraints • Expansions: alternative subtemplate calls • GUI specification
Partitioned Constraint Network Who Company Dest Weather OriginWeather Subject Dest. Addr. Origin Addr. Starting Time Distance Ending Time Travel Mode Depart Time Depart Airport Dist. toAirport Arrival Time Parking Lot Taxi Fare Flight Num Parking Rate Mode toAirport Arrival Airport
Trip AND 1 3 2 ModeToDestination ModeHotel ModeNext OR OR OR Hotel NoOvernight Drive Fly Taxi AND Trip (Return Home) Trip (Return Office) Trip (New Leg) End Trip 1 2 3 ModeToAirport FlightDetail ModeFromAirport OR OR Drive Taxi Drive Taxi Template Hierarchy for the Travel Assistant
Dynamic Networks Generalization of Constraint Networks • Variables can be active or inactive • Normal Constraints x1 = k1 ^ … ^ xm = km xn = kn • Activity constraints: x1 = k1 ^ … ^ xm = km active(xn) • Inactive variables do not participate in the network, i.e., do not propagate constraints
Heracles: Template Selection • Core network • Computes values of template selection vars • Always active • Template selection variables • Inputs to activity constraints: determine the choice of subtemplates, i.e., which additional variables are active
Constraint Networks for Integrating Information • Components: • Representation of the variables • Representation of constraints • Hierarchical template representation • Constraint propagation
Constraint Propagation • Approach • When a variable is assigned a value, re-compute the value sets and assigned values of all dependent variables • Proceeds recursively until no values are changed or a cycle is detected • Core network • Propagates all variables through the core network • Remaining variables are computing when a template is opened • Does not perform full CSP • Less costly • Does not require all information in advance • Makes choices locally, so may fail to find optimal assignment
Discussion • General framework for interleaving planning and information gathering • Retrieves information as needed • Gathers and integrates data in a uniform framework • Evaluates tradeoffs and selects among alternatives • Allows the users to explore alternatives • Supports a wide variety of information types: databases, web pages, images, video, etc.
SmartClients [Torrens et al, 2002] • Cast an integration problem as a Constraint Satisfaction Problem (CSP) • Given a request, the server retrieves the required data and sends the data and the CSP to the client • Client solves the CSP locally • Large complex problem transmitted in small amount of space • Provides fine-grained user interaction with the data
SmartClients: Pros and Cons • Pros • Elegant approach that exploits past work on CSPs • Minimizes the data retrieval and supports complex reasoning and integration of the data • Cons • Assumes that all data can be retrieved before any reasoning about the data • In the travel planning, assumes that prices are the same on any date and there are no issues with flight availability
Summary • Our approach for creating “web assistants”: • Agents for accessing web data • Record linkage for mapping between sources • Constraint-based integration provides the glue