INFORMATION RETRIEVAL AND WEB SEARCH

INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)

INFORMATION RETRIEVAL • GOAL: Find the documents most relevant to a certain QUERY • Latest development: WEB SEARCH • Use the Web as the collection of documents • Related: • QUESTION-ANSWERING • DOCUMENT CLASSIFICATION

INFORMATION RETRIEVAL:SUBTASKS • INDEX the documents in the collection • (offline) • PROCESS the query • EVALUATE SIMILARITY and find RANKs • Find documents most closely matching the query • DISPLAY results / enter a DIALOGUE • E.g., user may refine the query

DOCUMENTS AS BAGS OF WORDS INDEX DOCUMENT broadmay rallyrallied signal stockstocks techtechnology traderstraders trend broad tech stock rally may signal trend - traders. technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

SUBTASKS I: INDEXING • PREPROCESSING • Deletion of STOPWORDS • STEMMING • Selection of INDEX TERMS

INDEXING I: PREPROCESSING • PUNCTUATION REMOVAL • (Crestani et al) • CASE FOLDING • London  london • LONDON  london • DIGIT REMOVAL • But: SPARCStation 5

INDEXING II: STOPWORD REMOVAL • Very frequent words are not good discriminators • Many of these are CLOSED CLASS words • INQUERY’s list of stop words beginning with letter “a”: • a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at • Domain-specific stopwords • search, webmaster, copyright, www

INDEXING III:STEMMING • Simplest: suffix stripping • PORTER STEMMER: inflectional & derivational morphology • develop  develop • developing  develop • development  develop • developments  develop • BUT: photography  photographi • The effectiveness of stemming: • For English: increase in recall doesn’t compensate loss in precision • For other languages: necessary • E.g., Abdul Goweder’s dissertation

STORAGE • Requirements • Huge amounts of data • Lots of redundancy • Quick random access necessary • Indexing techniques: • Inverted index files • Suffix trees / suffix arrays • Signature files

STORAGE TECHNIQUES:INVERTED INDEX DOCUMENT1 INVERTED INDEX broad tech stock rally may signal trend - traders. broad  {1}gain  {2}rally  {1,2}score  {2}signal  {1} stock  {1,2}tech  {1}technology  {2}traders  {1,2}trend  {1}tuesday  {2} DOCUMENT2 technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

SIMILARITY MODELS • Boolean model • Probabilistic model • Vector-space model

THE BOOLEAN MODEL • Each index term is either present or absent • Documents are either RELEVANT or NOT RELEVANT (no grading of results) • Advantages • Clean formalism, simple to implement • Disadvantages • Exact matching only • All index terms equal weight

THE VECTOR SPACE MODEL • Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS • Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models) • RANKED similarity • Most popular of all models (cfr. Salton and Lesk’s SMART)

dj θ qk SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE

TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE FREQUENCY of term i in document k Number of documents with term i

EVALUATION • One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy

Simplest quantitative evaluation metrics • ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank • Problem with accuracy: only really useful when classes of approximately equal size (not the case in IR) ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR

A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf CDKBCWDK

Positives and negatives FALSE NEGATIVES TP FP TRUE NEGATIVES

Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected

The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure

WEB SEARCH • In many senses, just a form of IR • But: • Further information one has to take into account • Markup • Hyperlinks • Meta tags • Extra problems • Document highly heterogeneous • Multimedia • Quality of data

GOOGLE • Key aspects of Google’s search algorithm (as far as we know!) • Analyze link structure: PAGE RANK • Exploit visual presentation • Page Rank used to rank retrieved documents in addition to similarity measures • Page Rank motivations: • Most important papers are those cited most often • Not all sources of citations are equally reliable

PAGE RANK Probability q of randomly jumping to that page Page p Pages pointing to p

READINGS AND REFERENCES • Jurafsky and Martin, chapter 10.1-10.4 • Other references • Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7th WWW conference (WWW7),Brisbane • F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552 • Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex • Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137 • G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36

INFORMATION RETRIEVAL AND WEB SEARCH

INFORMATION RETRIEVAL AND WEB SEARCH

Presentation Transcript

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Web Search and Information Retrieval

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search

Information Retrieval and Web Search