1 / 22

Web Mining

Web Mining. Why IR ?. Why IR ?. Research & Fun. http://duilian.msra.cn. Overview of Search Engine. Flow Chart of SE. Text Processing (1) - Indexing. A list of terms with relevant information Frequency of terms Location of terms Etc.

horace
Download Presentation

Web Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Mining

  2. Why IR?

  3. Why IR?

  4. Research& Fun http://duilian.msra.cn

  5. Overview of Search Engine

  6. Flow Chart of SE

  7. Text Processing (1) - Indexing • A list of terms with relevant information • Frequency of terms • Location of terms • Etc. • Index terms: represent document content & separate documents • “economy” vs “computer” in a news article of Financial Times • To get Index • Extraction of index terms • Computation of their weights

  8. Text Processing (2) - Extraction • Extraction of index terms • Word or phrase level • Morphological Analysis (stemming in English) • “information”, “informed”, “informs”, “informative” • inform • Removal of stop words • “a”, “an”, “the”, “is”, “are”, “am”, …

  9. Text Processing (3) – Term Weight • Calculation of term weights • Statistical weights using frequency information • importance of a term in a document • E.g. TF*IDF • TF: total frequency of a term k in a document • IDF: inverse document frequency of a term k in a collection • DF: In how many documents the term appears? • High TF , low DF means good word to represent text • High TF, High DF means bad word

  10. An Example Document 1 Document 2

  11. 1 1 2 2 1 1 1 1 … University Arizona Text Processing (4) - Storing indexing results Document 1 Index Word Word Info. : : : Document 2

  12. Text Processing (2) - Storing indexing result

  13. Text Processing (3) - Inverted File

  14. Matching & Ranking (2) • Ranking • Retrieval Model • Boolean (exact) => Fuzzy Set (inexact) • Vector Space • Probabilistic • Inference Net ... • Weighting Schemes • Index terms, query terms • Document characteristics

  15. Vector Space Model

  16. Matching & Ranking (2) • Techniques for efficiency • New storage structure esp. for new document types • Use of accumulators for efficient generation of ranked output • Compression/decompression of indexes • Technique for Web search engines • Use of hyperlinks • Inlinks & outlinks (PageRank) • Authority vs hub pages (HITS) • In conjunction with Directory Services (e.g. Yahoo)

  17. Pagerank Algorithm • Basic idea: more links to a page implies a better page • But, all links are not created equal • Links from a more important page should count more than links from a weaker page • Basic PageRank R(A) for page A: • outDegree(B) = number of edges leaving page B = hyperlinks on page B • Page B distributes its rank boost over all the pages it points to

  18. Readings • Gregory Grefenstette (1998). “The Problem of Cross-Language Information Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers. • Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct. • Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia. • James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers. • Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58. • Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html • Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press. • Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” • Ralph Grishman (1997). “Information Extraction: Techniques and Challenges.” In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)

More Related