370 likes | 477 Views
Day 13. Information Retrieval Project 1 – Overview/Recap HW#4. Project #1 Overview/Recap. HW#4. Build 4 Portuguese LMs Over different sets of the afp data (4 LMs total) Smooth them Use Add-1 (Laplace) Use Good Turing Measure perplexity against an independent test set. HW#4.
E N D
Day 13 Information Retrieval Project 1 – Overview/Recap HW#4
HW#4 • Build 4 Portuguese LMs • Over different sets of the afp data (4 LMs total) • Smooth them • Use Add-1 (Laplace) • Use Good Turing • Measure perplexity against an independent test set
HW#4 • Output, perplexity scores: LM 1 – 96, Add 1 n.nn LM 2 – all, Add 1 n.nn LM 3 – 96, Good Turing n.nn LM 4 – all, Good Turing n.nn • For GT, frequency of frequency tables: • LM 3 – Good Turing cNc c* 0 nnnnnn.nn 1 nnnnnn.nn 3 nnnnnn.nn 4 nnnnnn.nn 5 nnnnnn.nn
Information Retrieval • Definition (from wikipedia): Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, …or hypertext networked databases such as the Internet or intranets, for text, sound, images or data.
Information Retrieval • Getting information from document repositories • Normally text (but other data types possible, e.g., sound, images, etc.) • Traditionally separate field from NLP, and always very empirically based • There is much scope for greater profitable interaction between IR and Statistical NLP
Information Retrieval • In IR, we’re generally looking for documents that contain the information we want • Googling or Binging is performing a kind of IR: • Salma Hayek • Charlie Chaplin • Computational Linguistics University of Washington • Google and Bing return documents that might have the information we’re interested in • Results usually returned as ranked list (most relevant at top)
IR Tasks • Ad hoc retrieval • User queries • System returns relevant documents • Document categorization (e.g., by topics) • Filtering: binary categorization as to relevance (e.g., SPAM filtration) • Routing: categorization before making available • Document Clustering
IR: Ambiguity • One problem with IR is ambiguity, especially at the lexical level • “Bars Seattle area” • “ERG language” • “bass columbia river area” • We may get a lot of false hits • Of course, ambiguity affects a lot of CompLing tasks!
IR: Sparsity • IR systems index everything • But, usually not… • Common words (what to do?) • Morphologically inflected (what to do?)
IR: Sparsity • IR systems index everything • But, usually not… • Common words (what to do?) • Morphologically inflected (what to do?) • IR systems also often assign term weights
IR Stop Words • Common English stop words:
Information Retrieval • IR fundamentally addresses this problem: Given a query W1 and a document W2 attempt to decide relevance of W2to W1, where relevance is meant to be computed with respect to their hidden meanings M1 and M2 • How do we implement an IR system?
IR: Vector Space Models • Often used in IR and search tasks • Essentially: represent some source data (text document, Web page, e-mail, etc.) by some vector representation • Vectors composed of counts/frequency of particular words (esp. certain content words) or other objects of interest • ‘Search’ vector compared against ‘target’ vectors • Return most closely related vectors
Vector Space Models • Terms and documents represented as vectors in k-dimens (usually) based on bag of words they contain: d = The man said a space age man appeared d’ = Those men appeared to say their age
Vector Space Models • Terms and documents represented as vectors in k-dimens (usually) based on bag of words they contain: d = The man said a space age man appeared d’ = Those men appeared to say their age Term weight
VSMs and Queries • Queries against VSM usually cast as vectors in k-dimensions as well, e.g., • Documents as vectors of (non-stopped) terms: • Doc 1: [bass, guitar, player, folk, festival, Columbia, river] • Doc 2: [bass, lake, stream, fishing, Columbia, river] • Queries as vectors: • Query 1: [bass columbia river area] • Query 2: [bass fishing columbia river area] • Query 3: [bass player columbia river area] • Query 4: [guitar player folk music] • Query 5: [bass fish cooking class eel river]
VSMs and Similarity • How do we gauge similarity between vectors? • Cosine metric: measure angle between two vectors • Similarity (distance) = cos qd d q
VSMs and Similarity • Vector dot product – how much the vectors have in common • Cosine distance is equivalent (vector dimensions normalized): sim(qk,dj) = qkdj = Σwi,k wi,j = cos qd • Cos = 1 means identical vectors (e.g., q = d) • Cos = 0 means orthogonal (completely opposite, no terms shared) N i=1
VSMs and Similarity • Vector length • If vectors are not of same length, then normalize, dividing each component by vector’s length: sim(q,d) =
VSMs: Term Weighting • In a given document, what are the most important words? • Those that occur most frequently? • Weight all terms in vectors just by frequency of occurrence in a particular doc?
VSMs: Term Weighting • In a given document, what are the most important words? • Those that occur most frequently? • Weight all terms in vectors just by frequency of occurrence in a particular doc? • In a document collection • how do we “filter out” words common across collection • versus words important in a particular document (e.g., indicative of the “meaning” of this document)
VSMs: Term Weighting • Simplest term (vector component) weightings are: • count of number of times word occurs in document • binary: word does or doesn’t occur in document • However, general experience is that a document is a better match if a word occurs three times rather than once • but not a three times better match! • Weighting functions to “dampen” term weight: 1 + log(x), x > 0, else 0 √x • Good, but not perfect: • Captures importance of term in document • But not if more important to this document versus other documents
tf.idf • tf = term frequency • Term frequency within a document • Intuition: more frequent term in a document more likely to reflect document’s meaning
tf.idf • idf = inverse document frequency • General importance of the term (across the collection) • Squashed using log: log(N/ni) • N = all docs in collection • ni = all docs this term is found in
tf.idf • Weighting a term: • tf.idf weight for a given term y: tf(y) x idf(y)
tf.idf • Weighting a term: • tf.idf weight for a given term y: tf(y) x idf(y) • Example: • Doc d w/ 100 words has “fish” 3 times tf(d,fish) = 3/100 = 0.03 • Assume 10M docs, with fish occurring in 1K docs idf(fish) = log(10,000,000 / 1,000) = 4 • tf.idf(d,fish) = .03 x 4 = 0.12
tf.idf • What does it do? • High tf/idf weight for term that is common in a document but less common in collection • Low tf/idf weight for term that is rare in a document and more common in collection • Common terms tend to get lower weight • Can be done with limited supervision.
VSM: doing a query • Return documents ranked on closeness of their vectors to the query • Doc 1: [bass, guitar, player, folk, festival, Columbia, river] • Doc 2: [bass, lake, stream, fishing, Columbia, river] • Doc 3: [fish, cooking, class, Seattle, area] • Query 1: [bass fishing columbia river area] • Query 2: [bass fish cooking class eel river] d1 d3 q2 d2 q1
VSMs: doing a query • But, • Query vectors will likely need to have the same dimensions. • Unlike documents though, dimensions may be sparsely populated. • This could lead to some problems. • Also, • What to do with OOVs/UNKs?
Evaluating IR Systems • Precision, Recall T = docs returned by request R = relevant docs in T (relevant Ç retrieved) N = irrelevant docs in T U = relevant docs in collection (all relevant) • Then |R| Precision = ------ |T| |R| Recall = ------ |U| • Problems?
Evaluating IR Systems • Precision, Recall T = docs returned by request R = relevant docs in T (relevant Ç retrieved) N = irrelevant docs in T U = relevant docs in collection (all relevant) • Then |R| Precision = ------ |T| |R| Recall = ------ |U| • Problems? P&R well-defined for sets, not so great for ranked lists
Interpolated Precision • Eleven-point interpolated precision • Interpolated precision measured at 11 recall levels (0.0, 0.1, 0.2), e.g.,
Mean Average Precision (MAP) • For set of relevant docs at or above r, average precision is: • MAP typically calculated over ensemble of queries (i.e., we average the average precisions) • Tends to favor systems that return relevant docs at high ranks (arguably what we want!) • But…also, since recall is ignored, it can favor systems that return small sets (dependent on value for r) • For competitions, e.g., TREC, MAP calculated over ensembles of 50-100 queries
Beyond IR • We can envision retrieving the • Passage in the document that contains the answer • Extracting the answer itself • Question Answering (QA) • We call also envision IR across languages • Cross Language IR (CLIR)