150 likes | 162 Views
Explore how wrappers extract information from various sources, including table and list displays, for effective comparison shopping.
E N D
A Semi-Universal E-Commerce Agent International Conference on Enterprise Information Systems 2002 Cuidad Real, Spain Aleksander PivkDepartment of Intelligent SystemsJozef Stefan InstituteLjubljana, Slovenia 03. April 2002
What is an (intelligent) agent? • An intelligent agent is a computer system capable of flexible,autonomous action in some environment. • Examples: • Environment: internet agent, OS agent, desktop agent, www agent, etc. • Task: information agent, shopping agent, interface agent, email agent, notification agent, etc. ICEIS 2002
Information agents • Task: • access/integrate information from a variety of data sources • Types: • Information Retrieval Agents • search engines • Information Filtering Agents • mail agents, news-delivery agents • Information Extraction Agents • wrappers • Information Integration Agents • meta-search engine, comparison-shopping ICEIS 2002
Information Extraction • IE is the task of identifying the specific fragments of a single document that constitute its core semantic content. Examples: a) from weather report identify locations, dates, temperatures (high and low); b) from online stores get product names, their images, and prices. NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392-5751 ICEIS 2002
Wrappers • A wrapper is … • a procedure or a rule that explains how to extract information from an information source • tailored to a particular document collection • appropriate to semi-structured information source • Why using wrappers? • heterogeneous information sources • different styles of user interface and different formats of output display ICEIS 2002
Wrapper Learning • Why learning? • ad hoc formatting conventions used at one site are rarely relevant elsewhere • sites often change their formatting • scalability is the major challenge to IE • Automatic wrapper construction • A site’s wrapper is constructed from a set of example pages • Wrapper induction ICEIS 2002
Implemented Systems • EMA – Employment Agent • memory-based approach • hand-coded wrappers • depends upon the profession ontology (domain-knowledge) • ShinA – Customized Comparison Shopping Agent • simple heuristic-based approach • little domain-knowledge used ICEIS 2002
ShinA – Shopping Assistant ICEIS 2002
Our focus • Wrapper learning in real time • to realize customized comparison shopper • Little use of domain knowledge • rather use simple heuristics • exploit the characteristics of semi-structured documents • Flexible and Practical • handle both table-type and list-type displays • handle noisy product description (missing attributes) • handle single product description in multiple lines ICEIS 2002
Learning Query Scheme Templates <form site= "amazon.com"> <name>searchform</name> <method>post</method> <action>www.amazon.com/exec/obidos/search-handle-form</action> <input type= "text" name="field-keywords" size=“15" /> <input type= "image" name= "Go"/> <select name= "index"> <option value= “all products" selected /> <option value= "books" /><option value= "…" /> </select> </form> ICEIS 2002
Learning product descriptions • Table-type display of 5 different PDU’s • Task • recognize each PDU • recognize attributes within PDU • learn rules to extract attributes PDU - Product Description Unit ICEIS 2002
PDU Pattern Learning: Algorithm • First phase • remove irrelevant parts of HTML source (header, advertisements, footer) • the remaining HTML source is broken into logical lines • Second phase • categorize each logical line • 9 different categories (PRICE, TITLE, IMAGE, URL_LINK, TTAG, LBTAG, etc.) • Third phase • find most frequent pattern(s) for PDU(s) in the sequence of logical line categories ICEIS 2002
PDU Pattern Learning: Example A fragment of the HTML source of the search result for the query “intelligent agent“ to Amazon bookstore. <img src="http://g-images.amazon.com/images/G/01/v9/130668.jpg" width="80“ height="80" vspace="2" alt=""> --2 </td> --4 <td> --4 <p> --5 <a href="http://www.amazon.com/book.asp?id=010101&book=130668"> --3 Intelligent Internet Agents: Agent-Based Information Discovery on the Internet --1 </a> --9 <br> --5 $59.95 --0 { 0:price; 1:title; 2:image; 3:link; 4:table tag; 5:line tag, 9:other tag; } Extracted PDU pattern: 244531950 ICEIS 2002
Simple Heuristics • Recognizing a title • contains at least one query word • text line that corresponds to pre-determined pattern’s title • Recognizing a price • contains a currency symbol ($, €) • contains a currency token (EUR, SIT) • contains digit(s) with relevant delimiters (‘,’; ‘.’) • Recognizing an image • unique image url-address within pattern • Able to recognize attributes with heuristic rules • examples: ISBN numbers, dates, discount rates • Unable to recognize other attributes • authors, review comments, recommendation status ICEIS 2002
Conclusion • Limitations • query search box must exist • price information must exist • extracts only a few attributes (title,price,image,link) • Future work • more use of domain knowledge (ontologies) • extract other non-price attributes • use of XML-based wrappers • applications to other domains ICEIS 2002