200 likes | 324 Views
Information Extraction from the WWW using Machine Learning Techniques. Lee McCluskey, Dept of Informatics email: lee@hud.ac.uk. Motivation.
E N D
Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics email: lee@hud.ac.uk Departmet of Informatics, Univeristy of Huddersfield
Motivation General:The WWW is a virtually limitless mass of information aimed mainly for human consumption. It is desirable to make this information generally available for use by computer programs in order to provide higher levels of to service to people. This supports the new area of “Semantic Technologies” – apparently the new “billion dollar” market.. NOW: Desk Top + Client-Server Technologies COMING: Distributed Intelligent Services Specific:This work is related to a Knowledge Transfer Partnership just starting with a local company called View Based Systems. Departmet of Informatics, Univeristy of Huddersfield
Overview of Talk We will investigate Information Extraction: This is the process of extracting “meaningful” data from raw or semi-structured text We will investigate techniques from ‘similarity-based’ Machine Learning to learn/extract meaning from traditional web page content Also, Information Agents: These are programs that can retrieve information from web sites using database-like queries and can integrate info from web sites to solve complex queries Departmet of Informatics, Univeristy of Huddersfield
Information Extraction from the WWW – WHY? Problem: You’re on ebay and you want a toilet cistern & wash basin that have a combined width of under 90cm Solution: waste all Sunday afternoon going through 673 entries for “toilet” looking for widths and cross checking with 923 entries for wash basin! • Need a universally-recognised query language • Need to avoid the problems of identity (!) with universally-accessible vocabularies • Need to be able to reason with acquired knowledge Departmet of Informatics, Univeristy of Huddersfield
Information Extraction from the WWW – WHY? Our (KTP) interest – extract data from www related to a “theme” or subculture eg bee-keeping, role playing games, Northern Soul music.. We want to populate and maintain a central database with this information … Departmet of Informatics, Univeristy of Huddersfield
Information Extraction from The Web • Information extraction is the process of extracting “meaningful” data from raw or semi-structured text • IE tasks form a spectrum .. HARDER EASIER • “Feature Extraction” - extract a particular piece of data from a semi- or unstructured document and give it an XML markup eg extract an address from an html web page. “Natural Language Understanding” - take raw (English) text from a web page and turn into some logic representing its meaning. Departmet of Informatics, Univeristy of Huddersfield
Information Extraction from The Web STRUCTURED DATA WRAPPERS WEB PAGES Departmet of Informatics, Univeristy of Huddersfield
Information Extraction • The Web’s HTML content makes it difficult to retrieve and integrate data from multiple sources. • An agent can use a wrapper to extract the information from the collection of similarly-looking Web pages. • The wrapper ~ grammar of the data in the web site + code to utilize the grammar • This is similar to turning the HTML => XML+ grammar (DTD) Departmet of Informatics, Univeristy of Huddersfield
Example of Automated Extraction Source: HTML ======> Destination: XML <h1> Residential Housing </h1> <ul>House For Sale <li> location: Hebden Bridge <li> agent-phone: 01422 843222 <li> listed-price: £350,000 <li> comments: Bijou residence on the edge of this popular little town... </ul> <hr> <ul> House For Sale ... </ul> ... <residential> <house> < location> <city> Hebden Bridge </city> <county> West Yorkshire </county> <country> UK </country> </location> <agent-phone> 01422 843222 </agent-phone> <listed-price> £350,000 </listed-price> <comments> Bijou residence on the edge of this popular little town... </comments> </house> ... </residential> wrapper NB: XML + schema + recognised names Departmet of Informatics, Univeristy of Huddersfield
Information Extraction How can we create wrappers to ‘extract meaningful data’ from the current Web? ?? Write a wrapper to extract data …. BUT would have to write a tool for every type of data / every type of webpage eg a C program to process every eBay page on toilets and output widths. No - This is far too specific! ?? Write a tool to learn wrappers by inducing the format of web pages and/or particular fields. .. this is more general and maintainable Departmet of Informatics, Univeristy of Huddersfield
Using ‘Rule Induction’ to learn wrappers for html pages • The user is given or acquires ‘typical examples’ of the web pages containing the content to be learned • The user points out fields to be learned to the agent. • The agent builds up a characterization of the formats from the examples and transforms this into a wrapper in the form of a set of rules • The wrapper is used by the agent to recognize and extract data from similar web pages Departmet of Informatics, Univeristy of Huddersfield
Rule Induction is an area of Machine Learning Machine Learning Symbolic Learning Sub-symbolic learning Similarity-Based Learning Explanation-Based Learning Learning from Examples Learning by Observation Genetic Approaches Neural Networks Rule Induction Departmet of Informatics, Univeristy of Huddersfield
Rule Induction from Examples Roughly, the algorithm is as follows: Input: a (large) number of +ve instances (examples) of concept C + (possibly) a number of –ve instances of C Output: a characterization H of the examples forming the rule H => C Departmet of Informatics, Univeristy of Huddersfield
Actual IE Example: University of Southern California’s Info Sciences Institute (ISI)’s “Information agent” SPECIFIC PROBLEM: travel planning using the Web as an information source. There are huge number of travel sites, with different types of information. - hotel and flight information, - airports that are closest to your destination, - directions to your hotel - weather in the destination city …ETC Information Agents are capable of retrieving and integrating info from web sites to solve complex queries or tasks eg “book my travel for my business trip next week” See the Heracles project (http://www.isi.edu/info-agents/) Departmet of Informatics, Univeristy of Huddersfield
Heracles’ Stalker inductive algorithm • This generates wrappers – in this case rules that identify the start and end of an item within a web page. • It uses • EXAMPLES • A HIERARCHICAL MODEL (ONTOLOGY) OF WHAT TO EXPECT IN A WEB PAGE Departmet of Informatics, Univeristy of Huddersfield
Example of training examples Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of the item (or concept) “area code” of a tel no, E1: 513 Pixco, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515 E2: 90 Colfax, <b> Palms </b>, Phone: ( 818 ) 508-1570 E3: 523 1st St., <b> LA </b>, Phone: 1-<b> 888 </b>-578-2293 E4: 403 La Tijera, <b> Watts </b>, Phone: ( 310 ) 798-0008 Stalker learns wrappers that detect the begin/end patterns of fields so that they can be used to ‘mine’ data in unseen web pages Departmet of Informatics, Univeristy of Huddersfield
Problems with Wrapper Induction ISI report some success with their travel Information Agent, and its IE process, BUT: • Wrapper Brittleness – website format may change – maintenance is costly • Background knowledge (token hierarchy) not strong • Unsupervised Wrapper induction would be better Departmet of Informatics, Univeristy of Huddersfield
Summary • Information Extraction is the process of extracting “meaningful” data from raw or semi-structured text • Wrappers are programs (rules) which are attached to web pages to extract data • Machine Learning techniques can be used to create wrappers • There are still many problems with these methods – especially in the learning and maintaining of wrappers Departmet of Informatics, Univeristy of Huddersfield
Extra Reading • http://www.isi.edu/info-agents/ • Learning to Extract Symbolic Knowledge from the World Wide Web. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. AAAI-98. January 1998. • “Hierarchical Wrapper Induction for Semi-structured Information Sources” Ion Muslea, Steven Minton, Craig A. Knoblock, Kluwer, 1999. • See Kushmerick references – apparently he invented wrapper induction Departmet of Informatics, Univeristy of Huddersfield
Related Legal/ Ethical/ Professional/ Methodological Issues • Is it legal and/or ethical to automatically ‘harvest’ data from the www and re-use or sell it? In what cases is it illegal? • How does one automate checking the veracity of www data? • Will website owners conceal their data if the practice becomes widespread? • Future: do we really want distributed web intelligence? Departmet of Informatics, Univeristy of Huddersfield