200 likes | 216 Views
Dive into Web Information Integration, Retrieval, Mining, and Techniques covering traditional IR systems, data extraction, & web usage analysis. Discover search engines, classification models, pattern mining, and applications in various fields.
E N D
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan chia@csie.ncu.edu.tw Sep. 16, 2005
Course Content • Web Information Integration • Web Information Retrieval • Traditional IR systems • Web Mining
Topic I: Web Information Integration • Search Interface Integration • Web page collection • Web data extraction • Search result integration • Web Service
Web Page Collection • Metacrawler http://www.metacrawler.com/ • Google · Yahoo · Ask Jeeves About · LookSmart · Overture · FindWhat • Ebay http://www.ebay.com/ • Information asymmetry between buyers and sellers • Technology • Program generators • WNDL, W4F, XWrap, Robomaker
Web Data Extraction • Example • Technology • Information Extraction Systems • WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA, Roadrunner, EXALG, XWrap, W4F, etc. • Data Annotation • Wrapper induction is an excellent exercise of machine learning technologies
Topic II: Web Information Retrieval • From User Perspective • Browsing via categories • Searching via search engines • Query answering • From System Perspective • Web crawling • Indexing and querying • Link-based ranking • Query answering • Semantic Web, XML retrieval, etc.
Web Categories • Yahoo http://www.yahoo.com • Fourteen categories and ninety subcategories • Categorization by humans • Technology • Document classification • Pros and Cons • Overview of the content in the database • Browsing without specific targets
Search Engines • Google http://www.google.com • Search by keyword matching • Business model • Technology • Web Crawling • Indexing for fast search • Ranking for good results • Pros and Cons • Search engines locate the documents not the answers
Question Answering • Askjeeves http://www.ask.com • Input a question or keywords • Relevance feedback from users to clarify the targets • ExtAns (Molla et al., 2003) • Technology • Text information extraction • Natural Language Processing
Topic III: Techniques from Traditional IR • Text Operations • Lexical analysis of the text • Elimination of stop words • Index term selection • Indexing and Searching • Inverted files • Suffix trees and suffix arrays • Signature files • IR Model and Ranking Technique • Query Operations • Relevance feedback • Query expansion
Topic IV: Web Mining • Usage Analysis • Focused Crawling • Clustering of Web search result • Text classification
Available Techniques • Artificial Intelligence • Search and Logic programming • Machine Learning • Supervised learning (classification) • Unsupervised learning (clustering) • Database and Warehousing • OLAP and Iceberg queries • Data Mining • Pattern mining from large data sets • Other Disciplines • Statistics, neural network, genetic algorithms, etc.
Classical Tasks • Classification • Artificial Intelligence, Machine Learning • Clustering • Pattern recognition, neural network • Pattern Mining • Association rules, sequential patterns, episodes mining, periodic patterns, frequent continuities, etc.
Classification Methods • Supervised Learning (Concept Learning) • General-to-specific ording • Decision tree learning • Bayesian learning • Instance-based learning • Sequential covering algorithms • Artificial neural networks • Genetic algorithms • Reference: Mitchell, 1997
Clustering Algorithms • Unsupervised learning (comparative analysis) • Partition Methods • Hierarchical Methods • Model-based Clustering Methods • Density-based Methods • Grid-based Methods • Reference: Han and Kamber (Chapter 8)
Pattern Mining • Various kinds of patterns • Association Rules • Closed itemsets, maximal itemsets, non-redundant rules, etc. • Sequential patterns • Episodes mining • Periodic patterns • Frequent continuities
Applications • Relational Data • E.g. Northern Group Retail (Business Intelligence) • Banking, Insurance, Health, others • Web Information Retrieval and Extraction • Bioinformatics • Multimedia Mining • Spatial Data Mining • Time-series Data Mining
Course Schedule • Web Data Extraction (3 weeks) • Web Interface Integration (1 week) • Web Page Collection (1 week) • Techniques from Traditional IR (2 weeks) • Query Answering (1 week) • Link Based Analysis (1 week) • Focused Crawling (1 week) • Web Usage Mining (1 week) • Clustering Search Result (1 week) • Text Classification (1 week)
Grading • Project I: 30% • Implementation of the chosen paper (W10) • Project II: 30% • Topic can be chosen freely (W16) • Paper reading: 20% • Presentation • Homework: 10% • Involvement in the Class: 10%
References • Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, Addison Wesley • Han, J. and Kamber, M. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers • Mitchell, T. M. 1997. Machine Learning, McGRAW-HILL. • Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and Hess, M. 2003. ExtrAns: Extracting Answers from Technical Texts. IEEE Intelligent Systems, July/August 2003, 12-17.