COMP791A: Statistical Language Processing

COMP791A: Statistical Language Processing Information Retrieval [M&S] 15.1-15.2 [J&M] 17.3

The problem The standard information retrieval (IR) scenario • The user has an information need • The user types a query that describes the information need • The IR system retrieves a set of documents from a document collection that it believes to be relevant • The documents are ranked according to their likelihood of being relevant • input: • a (large) set/collection of documents • a user query • output: • a (ranked) list of relevant documents

Example of IR

IR within NLP • IR needs to process the large volumes of online text • And (traditionally), NLP methods were not robust enough to work on thousands of real world texts. • so IR: • not based on NLP tools (ex. syntactic/semantic analysis) • uses (mostly) simple (shallow) techniques • based mostly on word frequencies • in IR, meaning of documents: • is the composition of meaning of individual words • ordering & constituency of words play are not taken into account • bag of word approach I see what I eat. I eat what I see. same meaning

2 major topics • Indexing • representing the document collection using words/terms • for fast access to documents • Retrieval methods • matching a user query to indexed documents • 3 major models: • boolean model • vector-space model • probabilistic model

Indexing • Most IR systems use an inverted file to represent the texts in the collection • Inverted file = a table of terms with a list of texts that contain these terms assassination {d1,d4,d95,d5,d90…} murder {d3,d7,d95…} Kennedy {d24,d7,d44…} conspiracy {d3,d55,d90,d98…}

Example of an inverted file • For each term: • DocCnt: how many documents the term occurs in (used to compute IDF) • FreqCnt: how many times the term occurs in all documents • For each document: • Freq: how many times the term occurs in this doc • WordPosition: the offsets where these occurrences are found in the document • useful to search for terms within n words of each other • to approximate phrases (ex. “car insurance”) • but… primitive notion of phrases… just word/byte position in document • “car insurance”  “ insurance for car” • to generate word-in-context snippets • to highlight terms in the retrieved document • …

Basic Concept of a Retrieval Model • documents and queries are represented by vectors of pairs <term-value> • term: all possible terms that occur in the query/document • value: presence or absence of term in query/document • value can be • binary (0, if term is absent ; 1, if term is present) • some weigh (term frequency, tf.idf, or other)

Vector-Space Model • binary values do not tell if a term is more important than others • so we should weight the terms by importance • weight of terms (for document & query) can be their raw frequency or other measure

Term-by-document matrix • the collection of documents is represented by a matrix of weights called a term-by-document matrix • 1 column = representation of one document • 1 row = representation of 1 term across all documents • cell wij= weight of term i in document j • note: the matrix is sparse !!!

An example • The collection: • d1 = {introduction knowledge in speech and language processing ambiguity models and algorithms language thought and understanding the state of the art and the near-term future some brief history summary} • d2 = {hmms and speech recognition speech recognition architecture overview of the hidden markov models the viterbi algorithm revisited advanced methods in decoding acoustic processing of speech computing acoustic probabilities training a speech recognizer waveform generation for speech synthesis human speech recognition summary} • d3 = {language and complexity the chomsky hierarchy how to tell if a language isn’t regular the pumping lemma are English and other languages regular languages ? is natural language context-free complexity and human processing summary} • The query: Q = {speech language processing}

An example (con’t) • The collection: • d1 = {introduction knowledge in speech and languageprocessing ambiguity models and algorithms language thought and understanding the state of the art and the near-term future some brief history summary} • d2 = {hmms and speech recognition speech recognition architecture overview of the hidden markov models the viterbi algorithm revisited advanced methods in decoding acoustic processing of speech computing acoustic probabilities training a speech recognizer waveform generation for speech synthesis human speech recognition summary} • d3 = {language and complexity the chomsky hierarchy how to tell if a language isn’t regular the pumping lemma are English and other language regular language ? is natural language context-free complexity and human processing summary} • The query: Q = {speechlanguageprocessing}

An example (con’t) • using raw term frequencies • vectors for the documents and the query can be seen as a point in a multi-dimensional space • where each dimension is a term from the query d2 (6,0,1) Term 1 (speech) d1 (1,2,1) Term 2 (language) d3 (0,5,1) q (1,1,1) Term 3 (processing)

Document similarity • The longer the document, the more chances it will be retrieved: • it makes sense, because it may contain many of the query's terms • but then again, it may also contain lots of non-pertinent terms… • we want to consider: • vector (1, 2, 1)  vector (2, 4, 2) • (same distribution of words) • we can normalise raw term frequencies to convert all vectors to a standard length (ex. 1)

Example Query = speech languageoriginal representation: d2 (6, 0) d1 (1, 2) language q (1, 1) speech d3 (0, 5) Normalization: - length of vector does not matter, - angle does.

The cosine measure • similarity between two documents (or doc & query) is actually the cosine of the angle (in N-dimensions) between the 2 vectors • if 2 document-vectors are identical, they will have a cosine of 1 • if 2 document-vectors are orthogonal (i.e. share no common term), they will have a cosine of 0 (D) Document (D) Document (D) Document (Q) Query (Q) Query (Q) Query

The cosine measure (con’t) • The cosine of 2 vectors (in N dimensions) • also known as the normalized inner product inner product lengths of the vectors

If you want proof… in 2-D space can be skipped • to have vectors of length 1 (normalized vectors) • divide all its components by the length of the vector • in 2 dimensional space:

can be skipped Normalized vectors Query = speech language language 1 d1’(0.45, 0.89) d2’(1, 0) Q’(0.71, 0.71) speech d3’(0, 1) 1 Q(1,1) --> normalized Q’(0.71, 0.71)d1(1,2) --> normalized d1’(0.45, 0.89)d2(6,0) --> normalized d2’(1, 0)d3(0,5) --> normalized d3’(0, 1)

Similarity between 2 vectors (2-D) can be skipped • In 2-D (ie. N= 2; nb of terms = 2) • with the original vectors: Q = (Xq, Yq) D = (Xd, Yd) • with the normalized vectors:

Similarity in the general case (N-D) can be skipped • in the general case of N-dimensions (N-terms) • which is the cosine of the angle between the vector D and vector Q in N-dimensions but for normalized vectors

The example again Q = {speech language processing} query (1,1,1) d1 (1,2,1) d2 (6,0,1) d3 (0,5,1)

Term weights • so far, we have used term frequency as the weights • core of most weighting functions: • tfij term frequency • frequency of a term i in document j • if a term appears often in a document, then it describes well the document contents • intra-document characterization • dfi document frequency • number of documents in the collection containing the term i • if a term appears in many documents, then it is not useful for distinguishing a document • inter-document characterization • used to compute idf

tf.idf weighting functions • most widely used family of weighting functions • let: • M = number of documents in the collection • Inverse Document Frequency for term i // measures weight // of term i for the // query • intuitively, if M = 1000 • if dfi = 1000 --> log(1) = 0 --> term iis ignored ! (itappears in all docs) • if dfi = 10 --> log(100) = 2 --> term ihas weight of 2 in the query • if dfi = 1 --> log(1000) = 3 --> term ihas weight of 3 in the query • weight of term i in document d is: wid = tfid x idfi • family of tf.idf functions frequency of most frequent term j in document d

A A+C A A+B Precision = Recall= Evaluation: Precision & Recall • Recall and precision measure how good a set of retrieved documents is compared with an ideal set of relevant documents • Recall: What proportion of relevant documents are actually retrieved? • Precision: What proportion of retrieved documents are really relevant? Pertinent docs that were retrieved Pertinent docs that were retrieved All pertinent docs (that should have been retrieved) All docs that were retrieved

Evaluation: Example of P&R • Relevant: d3 d5 d9 d25 d39 d44 d56 d71 d123 d389 • system1: d123 d84 d56 • Precision : ?? • Recall : ?? • system2: d123 d84 d56 d6 d8 d9 • Precision : ?? • Recall : ??

Evaluation: Example of P&R • Relevant: d3 d5 d9 d25 d39 d44 d56 d71 d123 d389 • system1: d123 d84  d56 • Precision: 66% (2/3) • Recall: 20% (2/10) • system2: d123 d84 d56d6 d8 d9 • Precision: 50% (3/6) • Recall: 30% (3/10)

Evaluation: Problems with P&R • P&R do not evaluate the ranking • d123 d84 d84d123 • so other measures are often used: • Document cutoff levels • P&R curves • ...

Evaluation: Document cutoff levels • fix the number of documents retrieved at several levels • ex. top 5, top 10, top 20, top 100, top 500… • measure precision at each of these levels • Ex:

Evaluation: P&R curve • measure precision at different levels of recall • usually, precision at 11 recall levels (0%, 10%, 20%, …, 100%) 100% 80% precision 60% 40% 20% 0% recall 0% 20% 40% 60% 80% 100%

Which system performs better? 100% 80% precision 60% 40% 20% 0% recall 0% 20% 40% 60% 80% 100%

Evaluation: A Single Value Measure • cannot take mean of P&R • if R = 50% P = 50% M = 50% • if R = 100% P = 10% M = 55% (not fair) • take harmonic mean HM is high only when both P&R are high if R = 50% and P = 50% HM = 50% if R = 100% and P = 10% HM = 18.2% • take weighted harmonic mean wr: weight of R wp: weight of P a = 1/wr b= 1/wp • let β2 = a/b … which is called the F-measure

Evaluation: the F-measure • A weighted combination of precision and recall •  represents the relative importance of precision and recall • when  = 1, precision & recall have same importance • when  > 1, precision is favored • when  < 1, recall is favored

Evaluation: How to evaluate • Need a test collection • document collection (few thousand - few million documents) • set of queries • set of relevance judgements • humans must check all documents ??? • use pooling (TREC) • take top 100 from every submission/system • remove duplicates • manually assess these only

Evaluation: TREC • Text Retrieval Conference/Competition • run by NIST (National Institute of Standards and Technology) • 13th edition in 2004 • Collection: about 3 Gigabytes > 1 million documents • newswire & text news (AP, WSJ, …) • Queries + relevance judgments • queries devised and judged by annotators • Participants • various research and commercial groups compete • Tracks • cross-lingual, Filtering Track, Genome Track video-track, Web Track, QA, ...

COMP791A: Statistical Language Processing