1 / 20

Web Information Retrieval and Extraction

Dive into Web Information Integration, Retrieval, Mining, and Techniques covering traditional IR systems, data extraction, & web usage analysis. Discover search engines, classification models, pattern mining, and applications in various fields.

mmuniz
Download Presentation

Web Information Retrieval and Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan chia@csie.ncu.edu.tw Sep. 16, 2005

  2. Course Content • Web Information Integration • Web Information Retrieval • Traditional IR systems • Web Mining

  3. Topic I: Web Information Integration • Search Interface Integration • Web page collection • Web data extraction • Search result integration • Web Service

  4. Web Page Collection • Metacrawler http://www.metacrawler.com/ • Google · Yahoo · Ask Jeeves About · LookSmart · Overture · FindWhat • Ebay http://www.ebay.com/ • Information asymmetry between buyers and sellers • Technology • Program generators • WNDL, W4F, XWrap, Robomaker

  5. Web Data Extraction • Example • Technology • Information Extraction Systems • WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA, Roadrunner, EXALG, XWrap, W4F, etc. • Data Annotation • Wrapper induction is an excellent exercise of machine learning technologies

  6. Topic II: Web Information Retrieval • From User Perspective • Browsing via categories • Searching via search engines • Query answering • From System Perspective • Web crawling • Indexing and querying • Link-based ranking • Query answering • Semantic Web, XML retrieval, etc.

  7. Web Categories • Yahoo http://www.yahoo.com • Fourteen categories and ninety subcategories • Categorization by humans • Technology • Document classification • Pros and Cons • Overview of the content in the database • Browsing without specific targets

  8. Search Engines • Google http://www.google.com • Search by keyword matching • Business model • Technology • Web Crawling • Indexing for fast search • Ranking for good results • Pros and Cons • Search engines locate the documents not the answers

  9. Question Answering • Askjeeves http://www.ask.com • Input a question or keywords • Relevance feedback from users to clarify the targets • ExtAns (Molla et al., 2003) • Technology • Text information extraction • Natural Language Processing

  10. Topic III: Techniques from Traditional IR • Text Operations • Lexical analysis of the text • Elimination of stop words • Index term selection • Indexing and Searching • Inverted files • Suffix trees and suffix arrays • Signature files • IR Model and Ranking Technique • Query Operations • Relevance feedback • Query expansion

  11. Topic IV: Web Mining • Usage Analysis • Focused Crawling • Clustering of Web search result • Text classification

  12. Available Techniques • Artificial Intelligence • Search and Logic programming • Machine Learning • Supervised learning (classification) • Unsupervised learning (clustering) • Database and Warehousing • OLAP and Iceberg queries • Data Mining • Pattern mining from large data sets • Other Disciplines • Statistics, neural network, genetic algorithms, etc.

  13. Classical Tasks • Classification • Artificial Intelligence, Machine Learning • Clustering • Pattern recognition, neural network • Pattern Mining • Association rules, sequential patterns, episodes mining, periodic patterns, frequent continuities, etc.

  14. Classification Methods • Supervised Learning (Concept Learning) • General-to-specific ording • Decision tree learning • Bayesian learning • Instance-based learning • Sequential covering algorithms • Artificial neural networks • Genetic algorithms • Reference: Mitchell, 1997

  15. Clustering Algorithms • Unsupervised learning (comparative analysis) • Partition Methods • Hierarchical Methods • Model-based Clustering Methods • Density-based Methods • Grid-based Methods • Reference: Han and Kamber (Chapter 8)

  16. Pattern Mining • Various kinds of patterns • Association Rules • Closed itemsets, maximal itemsets, non-redundant rules, etc. • Sequential patterns • Episodes mining • Periodic patterns • Frequent continuities

  17. Applications • Relational Data • E.g. Northern Group Retail (Business Intelligence) • Banking, Insurance, Health, others • Web Information Retrieval and Extraction • Bioinformatics • Multimedia Mining • Spatial Data Mining • Time-series Data Mining

  18. Course Schedule • Web Data Extraction (3 weeks) • Web Interface Integration (1 week) • Web Page Collection (1 week) • Techniques from Traditional IR (2 weeks) • Query Answering (1 week) • Link Based Analysis (1 week) • Focused Crawling (1 week) • Web Usage Mining (1 week) • Clustering Search Result (1 week) • Text Classification (1 week)

  19. Grading • Project I: 30% • Implementation of the chosen paper (W10) • Project II: 30% • Topic can be chosen freely (W16) • Paper reading: 20% • Presentation • Homework: 10% • Involvement in the Class: 10%

  20. References • Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, Addison Wesley • Han, J. and Kamber, M. 2001. Data Mining:  Concepts and Techniques, Morgan Kaufmann Publishers • Mitchell, T. M. 1997. Machine Learning, McGRAW-HILL. • Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and Hess, M. 2003. ExtrAns: Extracting Answers from Technical Texts. IEEE Intelligent Systems, July/August 2003, 12-17.

More Related