1 / 47

Constraint-based Information Integration

Constraint-based Information Integration. Steven Minton Fetch Technologies Joint work with Craig Knoblock and Jose Luis Ambite (USC/ISI). Geocoder. Tiger Map Server. Integration System. LA County Restaurant Health Ratings. Zagat Restaurants Guide. Example Application. Outline.

marci
Download Presentation

Constraint-based Information Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Constraint-based Information Integration Steven Minton Fetch Technologies Joint work with Craig Knoblock and Jose Luis Ambite (USC/ISI)

  2. Geocoder Tiger Map Server Integration System LA County Restaurant Health Ratings Zagat Restaurants Guide Example Application

  3. Outline • Agents that access information sources on the web • AgentBuilder – learning from examples • ActiveAtlas -- standardizing data from multiple sources • Constraint-based Integration • Heracles – putting it all together

  4. Decision Support Application Programs Information Agent The Web Databases Knowledge Bases Computer Programs Information Agents

  5. Web Agents • Web agents provide uniform query language for data access: “Wrapping a web site”

  6. AgentBuilder • Supervised learning: Extraction rules created from examples • High precision • High reliability

  7. Extraction technology • Expressive extraction rule language: • Extraction rule = sequence of landmarks • Describes how to find the beginning and end of each field

  8. A Sequential Covering Algorithm for “Wrapper Induction” Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …

  9. A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … Initial candidate:SkipTo( ( )

  10. A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … SkipTo( <b> ()...SkipTo(Phone) SkipTo( () ...SkipTo(:) SkipTo(() Initial candidate:SkipTo( ( )

  11. A Sequential Covering Wrapper Induction Algorithm Training Examples: Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: … SkipTo( <b> ()...SkipTo(Phone) SkipTo( () ...SkipTo(:) SkipTo(() Initial candidate:SkipTo( ( ) … SkipTo(Phone) SkipTo(:) SkipTo( () ...

  12. Outline • Agents that access information sources on the web • AgentBuilder – learning from examples • Atlas -- standardizing data from multiple sources • Constraint-based Integration • Heracles – putting it all together

  13. Zagat’s Restaurant Guide Health Dept Restaurant Listings Art’s Deli California Pizza Kitchen Campanile Citrus Grill, The Philippe The Original Spago Art’s Delicatessen Ca’ Brea CPK The Grill Patina Philippe’s The Original The Tillerman How can the same objects be identified when they are stored in inconsistent text formats? The Problem: Multi-Source Inconsistency

  14. The Solution: Record Linkage Zagat’s Restaurants Dept. of Health Name Street Phone Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa's103 1st Ave. between 6th and 7th Sts. 212/228-0604 Binion's Coffee Shop 128 Fremont St. 702/382-1600 Les Celebrites 160 Central Park S 212/484-5113 Name Street Phone Art’s Deli 12224 Ventura Boulevard 818-756-4124 Teresa's 80 Montague St. 718-520-2910 Steakhouse The 128 Fremont St. 702-382-1600 Les Celebrites 155 W. 58th St. 212-484-5113

  15. Art’s Deli 12224 Ventura Boulevard 818-756-4124 Teresa’s 80 Montague St. 718-520-2910 Steakhouse The 128 Fremont St. 702-382-1600 Les Celebrites 155 W. 58th St. 212-484-5113 Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa’s 103 1st Ave. between 6th and 7th Sts. 212/228-0604 Binion’s Coffee Shop 128 Fremont St. 702/382-1600 Les Celebrites 5432 Sunset Blvd 212/484-5113 Query Record Linkage Zagat’s Agent Dept. of Health Agent Zagat’s Dept of Health Name Street Phone Name Street Phone

  16. Approach to Record Linkage • Learning attribute weighting rules • Learning general transformation rules Name Street Phone

  17. Active Learning to Determine Matched Records[Tejada, Knoblock, Minton ’01,’02] • Learn importance of attributes for matching records Name Street Phone Zagat’s Art’s Deli 12224 Ventura Boulevard 818-756-4124 Dept of Health Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Mapping rules: Name > .9 & Street > .87 => mapped Name > .95 & Phone > .96 => mapped

  18. Active AtlasMapping Rule Learner Label Choose initial examples Generate committee of learners USER Learn Rules Learn Rules Learn Rules Classify Examples Classify Examples Classify Examples Votes Votes Votes Choose Example Label Set of MappedObjects

  19. Committee Disagreement • Chooses an example based on the disagreement of the query committee • CPK, California Pizza Kitchen is the most informative example Committee Examples M1 M2 M3 Art’s Deli, Art’s Delicatessen CPK, California Pizza Kitchen Ca’Brea, La Brea Bakery Yes Yes Yes Yes No Yes No No No

  20. Outline • Agents that access information sources on the web • AgentBuilder – learning from examples • ActiveAtlas -- standardizing data from multiple sources • Constraint-basedIntegration • Heracles – putting it all together

  21. Constraint-based Integration • Integrating data from multiple sources often involves reasoning about the information • Constraints provide a approach to expressing relationships and filtering data

  22. Heracles • Framework for building integrated applications • Interleaves planning and information gathering • Uses a constraint reasoner to decide what sources to query and to integrate the results

  23. The Travel Assistant

  24. BLACK BLACK GREEN GREEN GREEN GREEN BLUE GREEN GREEN GREEN GREEN BLUE RED RED RED RED GREEN GREEN RED GREEN GREEN RED RED RED RED GREEN GREEN RED Dynamically Updates Slots as Information Becomes Available

  25. Supports Informed Choices

  26. Changes Propagate Throughout

  27. User Can Specify High-Level Preferences

  28. Constraint Networks for Managing Information • Constraint reasoning system • Propagates information • Decides when to launch information requests • Evaluate constraints • Computes preferences • All run as asynchronous processes to support the user • Components: • Representation of the variables • Representation of constraints • Hierarchical templates • Constraint propagation

  29. Constraint Networks for Integrating Information • Components: • Representation of the variables • Representation of constraints • Hierarchical template representation • Constraint propagation and cycle detection

  30. Constraint Variables • Constraint network consists of a set of variables such as: • MeetingStartTime • MeetingLocation • Variables are related by constraints that determine the possible values of a solution

  31. Constraint Networks for Integrating Information • Components: • Representation of the variables • Representation of constraints • Hierarchical template representation • Constraint propagation and cycle detection

  32. Constraint Representation • Constraints are computable components: • Local calculations (e.g., Xquery) • MeetingStartTime + MeetingDuration --> MeetingEndTime • Web and Database Wrappers • ITN: DepartureAirport, ArrivalAirport, Date --> Flights • Yahoo Weather: City, Date --> Weather predication • External Programs (Outlook, Planners, etc) • Outlook Calendar: Date --> Meetings • Results cached in tables

  33. OriginAddress Sep 30, 2000 GetDistance DepartureDate DestinationAddress 15.1 miles Oct 2, 2000 Distance ReturnDate FindClosestAirport LAX GetTaxiFare computeDuration DepartureAirport 3 days $21.00 $23.00 getParkingRate Duration ParkingTotal TaxiFare ParkingRate multiply SelectModeToAirport $7.00/day ModeToAirport Drive Drive or Take a Taxi?

  34. Constraint Networks for Integrating Information • Components: • Representation of the variables • Representation of constraints • Hierarchical template representation • Constraint propagation and cycle detection

  35. Hierarchically-Partitioned Constraint Networks • Template: • Groups related variables and constraints • Organizes information for computation and presentation to user • Templates organized hierarchically • Template decomposed into subtemplates • Choose among alternative subtemplates

  36. Template Structure Template • Arguments: input and output variables • Variables: name, type, default values • Constraints • Expansions: alternative subtemplate calls • GUI specification

  37. Partitioned Constraint Network Who Company Dest Weather OriginWeather Subject Dest. Addr. Origin Addr. Starting Time Distance Ending Time Travel Mode Depart Time Depart Airport Dist. toAirport Arrival Time Parking Lot Taxi Fare Flight Num Parking Rate Mode toAirport Arrival Airport

  38. Trip AND 1 3 2 ModeToDestination ModeHotel ModeNext OR OR OR Hotel NoOvernight Drive Fly Taxi AND Trip (Return Home) Trip (Return Office) Trip (New Leg) End Trip 1 2 3 ModeToAirport FlightDetail ModeFromAirport OR OR Drive Taxi Drive Taxi Template Hierarchy for the Travel Assistant

  39. Dynamic Networks Generalization of Constraint Networks • Variables can be active or inactive • Normal Constraints x1 = k1 ^ … ^ xm = km  xn = kn • Activity constraints: x1 = k1 ^ … ^ xm = km  active(xn) • Inactive variables do not participate in the network, i.e., do not propagate constraints

  40. Heracles: Template Selection • Core network • Computes values of template selection vars • Always active • Template selection variables • Inputs to activity constraints: determine the choice of subtemplates, i.e., which additional variables are active

  41. Constraint Networks for Integrating Information • Components: • Representation of the variables • Representation of constraints • Hierarchical template representation • Constraint propagation

  42. Constraint Propagation • Approach • When a variable is assigned a value, re-compute the value sets and assigned values of all dependent variables • Proceeds recursively until no values are changed or a cycle is detected • Core network • Propagates all variables through the core network • Remaining variables are computing when a template is opened • Does not perform full CSP • Less costly • Does not require all information in advance • Makes choices locally, so may fail to find optimal assignment

  43. Discussion • General framework for interleaving planning and information gathering • Retrieves information as needed • Gathers and integrates data in a uniform framework • Evaluates tradeoffs and selects among alternatives • Allows the users to explore alternatives • Supports a wide variety of information types: databases, web pages, images, video, etc.

  44. SmartClients [Torrens et al, 2002] • Cast an integration problem as a Constraint Satisfaction Problem (CSP) • Given a request, the server retrieves the required data and sends the data and the CSP to the client • Client solves the CSP locally • Large complex problem transmitted in small amount of space • Provides fine-grained user interaction with the data

  45. Architecture for SmartClients

  46. SmartClients: Pros and Cons • Pros • Elegant approach that exploits past work on CSPs • Minimizes the data retrieval and supports complex reasoning and integration of the data • Cons • Assumes that all data can be retrieved before any reasoning about the data • In the travel planning, assumes that prices are the same on any date and there are no issues with flight availability

  47. Summary • Our approach for creating “web assistants”: • Agents for accessing web data • Record linkage for mapping between sources • Constraint-based integration provides the glue

More Related