490 likes | 586 Views
Information Retrieval and Vector Space Model Presented by Jun Miao York University. Information Retrieval (IR). What is Information Retrieval?. = IR ? IR: Retrieve information which is relative to your need Search Engine Question Answering
E N D
Information Retrieval and Vector Space Model Presented by Jun Miao York University
What is Information Retrieval? = IR ? IR: Retrieve information which is relative to your need • Search Engine • Question Answering • Information Extraction • Information Filtering • Information Recommendation
In old days… • The term "information retrieval" may have been coined by Calvin Mooers • Early IR applications are used in libraries • Set-based retrieval the system partitions the corpus into two subsets of documents: those it considers relevant to the search query, and those it does not.
In nowadays • Ranked Retrieval the system responds to a search query by ranking all documents in the corpus based on its estimate of their relevance to the query. • free-form query expresses user’s information need • rank documents by decreasing likelihood of relevance • many studies prove it is superior
An Information Retrieval Process • (Borrow from Prof. Nie’s slides) Info. need Query IR system Document collection Retrieval Answer list
Lexical Analysis • What counts as a word or token in the indexing scheme? • A big topic
Stop List • function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, … • Stop list: contain stop words, not to be used as index • Prepositions • Articles • Pronouns • Some adverbs and adjectives • Some frequent words (e.g. document) • The removal of stop words usually improves IR effectiveness • A few “standard” stop lists are commonly used.
Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word dancer dancers dance danced dancing dance
Stemming(Cont’d) • Two main methods : Linguistic/dictionary-based stemming • high stemming accuracy • high implementation and processing costs and higher coverage Porter-style stemming • lower stemming accuracy • lower implementation and processing costs and lower coverage • Usually sufficient for IR
Flat file indexing • Each document is represented by a set of weighted keywords (terms): D1 {(t1, w1), (t2,w2), …} e.g. D1 {(comput, 0.2), (architect, 0.3), …} D2 {(comput, 0.1), (network, 0.5), …}
Query Analysis • Parse Query • Clean Stopwords • Stemming • Get terms • Adjacent operations • connect related terms together
Models • Matching score model • Document D = a set of weighted keywords • Query Q = a set of non-weighted keywords • R(D, Q) = i w(ti , D) where ti is in Q.
Models(Cont’d) • Boolean Model • Vector Space Model • Probability Model • Language Model • Neural Network Model • Fuzzy Set Model • ……
tf*idf weighting schema • tf = term frequency • frequency of a term/keyword in a document The higher the tf, the higher the importance (weight) for the doc. • df = document frequency • no. of documents containing the term • distribution of the term • idf = inverse document frequency • the unevenness of term distribution in the corpus • the specificity of term to a document • Idf = log(d/df) d= total number of documents The more the term is distributed evenly, the less it is specific to a document weight(t,D) = tf(t,D) * idf(t)
Evaluation • A result list according to a query What is its performance? retrieved relevant Relevant Retrieved
Metrics often used (together): • Precision = retrieved relevant docs / retrieved docs • Recall = retrieved relevant docs / relevant docs
Precision-Recall Trade-off Usually, more precision, less recall; More recall, less precision Return all documents: recall rate = 1 precision is very low
relevant documents relevant documents For Ranked List • Consider two result lists of two IR systems S1 and S2 according to one query: 1. 2. Which one is better???
relevant documents Average Precision • AP = sum(R(xi)/P(xi)) / n Xi ∈ Set of retrieved relative documents P(xi) : Rank of xi in retrieved list R(xi) : Rank of xi in retrieved relative document list n : Number of retrieved relative documents List 1: AP1 = ((1/1)+(2/3)+(3/6)+(4/9)+(5/10))/5 = 0.622
relevant documents Average Precision (Cont’d) • List 2 AP2 = ( (1/1)+(2/2)+(3/3)+(4/5)+(5/6) ) / 5 = 0.927 S2 is better than S1
Evaluating over multiple queries Mean Average Precision: Arithmetic mean of average precisions over all queries 5 Queries (Topics) and 2 IR systems S1 is better than S2
Other Measurements • Precision@N • R-Precision • F-measurement • E-measurement • ……
Problem • Sometimes, documents in the collections are numerous. It is hard to calculate recall rate.
Pooling • Step 1. Get top N documents from the results of IR systems to make a document pool. • Step 2. Experts check the pool, and tag these documents by relevant or non-relevant according to different topics
Difficulties in text IR • Vocabularies mismatching • Synonymy: e.g. car v.s. automobile • Polysemy: table • Queries are ambiguous, they are partial specification of user’s need • Content representation may be inadequate and incomplete • The user is the ultimate judge, but we don’t know how the judge judges… • The notion of relevance is imprecise, context- and user-dependent
Difficulties in web IR • No stable document collection (spider, crawler) • Invalid document, duplication, etc. • Huge number of documents (partial collection) • Multimedia documents • Great variation of document quality • Multilingual problem • …
NLP in IR • Simple methods: stop word, stemming • Higher-level processing: chunking, parsing, word sense disambiguation • Research about using NLP in IR needs more attention
Popular systems • SMART http://ftp.cs.cornell.edu/pub/smart/ • Terrier http://ir.dcs.gla.ac.uk/terrier/ • Okapi http://www.soi.city.ac.uk/~andym/OKAPIPACK/index.html • Lemur http://www-2.cs.cmu.edu/~lemur/ etc…
Conference and Journal • Conference • SIGIR • TREC • CLEF • WWW • ECIR … • Journal • ACM Transactions on Information Systems(TOIS) • ACM Transactions on Asian Language Information Processing(TALIP) • Information Processing & Management(IP&M) • Information Retrieval
Idea • Convert documents and queries into vectors, and use Similarity Coefficient(SC) to measure the similarity • Presented by Gerard Salton et al. in 1975, implemented in SMART IR system • Premise: all terms are independent
Construct Vector Each dimension corresponds to a separate term. Wi,j = weight of term j in document or query i
Doc-Term Matrix • N documents and M terms
Three Key problems 1.Term selection 2.Term weighting 3.Similarity Coefficient Calculation
Term Selection • Terms represent the content of documents • Term purification • Stemming • Stoplist • Only choose Nouns
Term Weight • Boolean weight: 1: appear 0: not appear • Term Frequency: • tf • 1+log(tf) • 1+(1+log(tf)) • Inverse Document Frequency • tf*idf
Term Weight (Cont’d) • Document Length • Two opinions: • Longer documents contain more terms • Longer documents have more information • Punish long documents and compensate to short documents Pivoted Normalization: 1-b+b*doclen/avgdoclen b in (0,1)
Similarity Coefficient Calculation Dot product Cosine Dice Jaccard t1 D Q t2
Example Q: “gold silver truck” • D1: “Shipment of gold delivered in a fire” • D2: “Delivery of silver arrived in a silver truck” • D3: “Shipment of gold arrived in a truck” • Document Frequency of the jth term (dfj ) • Inverse Document Frequency (idf) = log10(n / dfj) Tf*idf is used as term weight here
Example(Cont’d) Tf*idf is used here SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031 SC(Q, D2 ) = 0.486 SC(Q,D3) = 0.062 The ranking would be D2,D3,D1. • This SC uses the dot product.
Advantages of VSM • Fairly cheap to compute • Yields decent effectiveness • Very popular -- SMART is one of the most commonly used academic prototype
Disadvantages of VSM • No theoretical foundation • Weights in the vectors are very arbitrary • Assumes term independence • Sparse Matrix