230 likes | 380 Views
Web Mining. Why IR ?. Why IR ?. Research & Fun. http://duilian.msra.cn. Overview of Search Engine. Flow Chart of SE. Text Processing (1) - Indexing. A list of terms with relevant information Frequency of terms Location of terms Etc.
E N D
Research& Fun http://duilian.msra.cn
Text Processing (1) - Indexing • A list of terms with relevant information • Frequency of terms • Location of terms • Etc. • Index terms: represent document content & separate documents • “economy” vs “computer” in a news article of Financial Times • To get Index • Extraction of index terms • Computation of their weights
Text Processing (2) - Extraction • Extraction of index terms • Word or phrase level • Morphological Analysis (stemming in English) • “information”, “informed”, “informs”, “informative” • inform • Removal of stop words • “a”, “an”, “the”, “is”, “are”, “am”, …
Text Processing (3) – Term Weight • Calculation of term weights • Statistical weights using frequency information • importance of a term in a document • E.g. TF*IDF • TF: total frequency of a term k in a document • IDF: inverse document frequency of a term k in a collection • DF: In how many documents the term appears? • High TF , low DF means good word to represent text • High TF, High DF means bad word
An Example Document 1 Document 2
1 1 2 2 1 1 1 1 … University Arizona Text Processing (4) - Storing indexing results Document 1 Index Word Word Info. : : : Document 2
Matching & Ranking (2) • Ranking • Retrieval Model • Boolean (exact) => Fuzzy Set (inexact) • Vector Space • Probabilistic • Inference Net ... • Weighting Schemes • Index terms, query terms • Document characteristics
Matching & Ranking (2) • Techniques for efficiency • New storage structure esp. for new document types • Use of accumulators for efficient generation of ranked output • Compression/decompression of indexes • Technique for Web search engines • Use of hyperlinks • Inlinks & outlinks (PageRank) • Authority vs hub pages (HITS) • In conjunction with Directory Services (e.g. Yahoo)
Pagerank Algorithm • Basic idea: more links to a page implies a better page • But, all links are not created equal • Links from a more important page should count more than links from a weaker page • Basic PageRank R(A) for page A: • outDegree(B) = number of edges leaving page B = hyperlinks on page B • Page B distributes its rank boost over all the pages it points to
Readings • Gregory Grefenstette (1998). “The Problem of Cross-Language Information Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers. • Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct. • Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia. • James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers. • Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58. • Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html • Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press. • Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” • Ralph Grishman (1997). “Information Extraction: Techniques and Challenges.” In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)