260 likes | 640 Views
INFORMATION RETRIEVAL AND WEB SEARCH. CC437. (Based on original material by Udo Kruschwitz). INFORMATION RETRIEVAL. GOAL: Find the documents most relevant to a certain QUERY Latest development: WEB SEARCH Use the Web as the collection of documents Related: QUESTION-ANSWERING
E N D
INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)
INFORMATION RETRIEVAL • GOAL: Find the documents most relevant to a certain QUERY • Latest development: WEB SEARCH • Use the Web as the collection of documents • Related: • QUESTION-ANSWERING • DOCUMENT CLASSIFICATION
INFORMATION RETRIEVAL:SUBTASKS • INDEX the documents in the collection • (offline) • PROCESS the query • EVALUATE SIMILARITY and find RANKs • Find documents most closely matching the query • DISPLAY results / enter a DIALOGUE • E.g., user may refine the query
DOCUMENTS AS BAGS OF WORDS INDEX DOCUMENT broadmay rallyrallied signal stockstocks techtechnology traderstraders trend broad tech stock rally may signal trend - traders. technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.
SUBTASKS I: INDEXING • PREPROCESSING • Deletion of STOPWORDS • STEMMING • Selection of INDEX TERMS
INDEXING I: PREPROCESSING • PUNCTUATION REMOVAL • (Crestani et al) • CASE FOLDING • London london • LONDON london • DIGIT REMOVAL • But: SPARCStation 5
INDEXING II: STOPWORD REMOVAL • Very frequent words are not good discriminators • Many of these are CLOSED CLASS words • INQUERY’s list of stop words beginning with letter “a”: • a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at • Domain-specific stopwords • search, webmaster, copyright, www
INDEXING III:STEMMING • Simplest: suffix stripping • PORTER STEMMER: inflectional & derivational morphology • develop develop • developing develop • development develop • developments develop • BUT: photography photographi • The effectiveness of stemming: • For English: increase in recall doesn’t compensate loss in precision • For other languages: necessary • E.g., Abdul Goweder’s dissertation
STORAGE • Requirements • Huge amounts of data • Lots of redundancy • Quick random access necessary • Indexing techniques: • Inverted index files • Suffix trees / suffix arrays • Signature files
STORAGE TECHNIQUES:INVERTED INDEX DOCUMENT1 INVERTED INDEX broad tech stock rally may signal trend - traders. broad {1}gain {2}rally {1,2}score {2}signal {1} stock {1,2}tech {1}technology {2}traders {1,2}trend {1}tuesday {2} DOCUMENT2 technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.
SIMILARITY MODELS • Boolean model • Probabilistic model • Vector-space model
THE BOOLEAN MODEL • Each index term is either present or absent • Documents are either RELEVANT or NOT RELEVANT (no grading of results) • Advantages • Clean formalism, simple to implement • Disadvantages • Exact matching only • All index terms equal weight
THE VECTOR SPACE MODEL • Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS • Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models) • RANKED similarity • Most popular of all models (cfr. Salton and Lesk’s SMART)
dj θ qk SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE
TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE FREQUENCY of term i in document k Number of documents with term i
EVALUATION • One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy
Simplest quantitative evaluation metrics • ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank • Problem with accuracy: only really useful when classes of approximately equal size (not the case in IR) ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR
A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj affrvjfkjfkbvKFKRQVFsjfanvnf CDKBCWDK
Positives and negatives FALSE NEGATIVES TP FP TRUE NEGATIVES
Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected
The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure
WEB SEARCH • In many senses, just a form of IR • But: • Further information one has to take into account • Markup • Hyperlinks • Meta tags • Extra problems • Document highly heterogeneous • Multimedia • Quality of data
GOOGLE • Key aspects of Google’s search algorithm (as far as we know!) • Analyze link structure: PAGE RANK • Exploit visual presentation • Page Rank used to rank retrieved documents in addition to similarity measures • Page Rank motivations: • Most important papers are those cited most often • Not all sources of citations are equally reliable
PAGE RANK Probability q of randomly jumping to that page Page p Pages pointing to p
READINGS AND REFERENCES • Jurafsky and Martin, chapter 10.1-10.4 • Other references • Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7th WWW conference (WWW7),Brisbane • F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552 • Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex • Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137 • G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36