180 likes | 208 Views
Gain insights into web information retrieval, data extraction techniques, query answering, integration methods, and classification technologies in this comprehensive course.
E N D
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan chia@csie.ncu.edu.tw
Course Content • Web Information Retrieval • Browsing via categories • Searching via search engines • Query answering • Web Information Integration • Web page collection • Data extraction from semi-structured Web pages • Data integration
Web Categories • Yahoo http://www.yahoo.com • Fourteen categories and ninety subcategories • Categorization by humans • Technology • Document classification • Pros and Cons • Overview of the content in the database • Browsing without specific targets
Search Engines • Google http://www.google.com • Search by keyword matching • Business model • Technology • Web Crawling • Indexing for fast search • Ranking for good results • Pros and Cons • Search engines locate the documents not the answers
Question Answering • Askjeeves http://www.ask.com • Input a question or keywords • Relevance feedback from users to clarify the targets • ExtAns (Molla et al., 2003) • Technology • Text information extraction • Natural Language Processing
Web Page Collection • Metacrawler http://www.metacrawler.com/ • Google · Yahoo · Ask Jeeves About · LookSmart · Overture · FindWhat • Ebay http://www.ebay.com/ • Information asymmetry between buyers and sellers • Technology • Program generators • WNDL, W4F, XWrap, Robomaker
Data Extraction from Semi-structured Documents • Example • Technology • Information Extraction Systems • WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA, Roadrunner, EXALG, XWrap, W4F, etc. • Data Annotation • Wrapper induction is an excellent exercise of machine learning technologies
Data Integration • Technology • Template based interface design • Microsoft • Visual Programming tools
Available Techniques • Artificial Intelligence • Search and Logic programming • Machine Learning • Supervised learning (classification) • Unsupervised learning (clustering) • Database and Warehousing • OLAP and Iceberg queries • Data Mining • Pattern mining from large data sets • Other Disciplines • Statistics, neural network, genetic algorithms, etc.
Classical Tasks • Classification • Artificial Intelligence, Machine Learning • Clustering • Pattern recognition, neural network • Pattern Mining • Association rules, sequential patterns, episodes mining, periodic patterns, frequent continuities, etc.
Classification Methods • Supervised Learning (Concept Learning) • General-to-specific ording • Decision tree learning • Bayesian learning • Instance-based learning • Sequential covering algorithms • Artificial neural networks • Genetic algorithms • Reference: Mitchell, 1997
Clustering Algorithms • Unsupervised learning (comparative analysis) • Partition Methods • Hierarchical Methods • Model-based Clustering Methods • Density-based Methods • Grid-based Methods • Reference: Han and Kamber (Chapter 8)
Pattern Mining • Various kinds of patterns • Association Rules • Closed itemsets, maximal itemsets, non-redundant rules, etc. • Sequential patterns • Episodes mining • Periodic patterns • Frequent continuities
Applications • Relational Data • E.g. Northern Group Retail (Business Intelligence) • Banking, Insurance, Health, others • Web Information Retrieval and Extraction • Bioinformatics • Multimedia Mining • Spatial Data Mining • Time-series Data Mining
Techniques from Information Retrieval (IR) • Text Operations • Lexical analysis of the text • Elimination of stop words • Index term selection • Indexing and Searching • Inverted files • Suffix trees and suffix arrays • Signature files • Ranking Models • Query Operations • Relevance feedback • Query expansion
Course Schedule • Techniques from Information Retrieval • Text Operations • Indexing and Searching • Ranking Models • Query Operations • Text Information Extraction for Query answering • AutoSlog, SRV, Rapier, etc. • Data extraction from semi-structured Web pages • WIEN, Softmealy, Stalker, IEPAD, DeLA, Roadrunner, EXALG, OLERA, etc. • Web page collection • XWrap, W4F, Robomaker, etc.
Grading • Two projects (by groups): 50% • Chosen from the topics covered in the course • Presentation and reports • Paper reading (by yourself): 20% • Presentation • Information Integration Projects: 30% • Chosen freely • Presentation and reports
References • Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, Addison Wesley • Han, J. and Kamber, M. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers • Mitchell, T. M. 1997. Machine Learning, McGRAW-HILL. • Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and Hess, M. 2003. ExtrAns: Extracting Answers from Technical Texts. IEEE Intelligent Systems, July/August 2003, 12-17.