370 likes | 449 Views
Search A Basic Overview. Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014. Back in those days. We had access to much smaller amount of information Had to find information manually.
E N D
SearchA Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014
Back in those days We had access to much smaller amount of information Had to find information manually Once upon a time in the world, there were days without search engines
Search engine User needs some information A search engine tries to bridge this gap Assumption: the required information is present somewhere How: • User “expresses” the information need – query • Engine returns – list of documents, or by some better means
Search engine User needs some information A search engine tries to bridge this gap Assumption: the required information is present somewhere Simplest model • User submits query – a set of words (terms) • Search engine returns documents “matching” the query • Assumption: matching the query would satisfy the information need • Modern search has come a long way from the simple model, but the fundamentals are still required
Basic approach This is in Indian Statistical Institute, Kolkata, India • Documents contain terms • Documents are represented by terms present in them • Match queries and documents by terms • For simplicity: ignore positions, consider documents as “bag-of-words” • There may be many matching documents – need to rank them Diwali is a huge festival in India Statistically flying is the safest mode of journey This is autumn Thank god it is a holiday India’s population is huge There is no end of learning Query: india statistics
Vector space model Each term represents a dimension Documents are vectors in the term-space Term-document matrix: a very sparse matrix Query is also a vector in the term-space • Similarity of each document d with the query q is measured by the cosine similarity (dot product normalized by norms of the vectors)
Scoring function: TF.iDF • How important is a term t in a document d • Approach: take two factors into account • With what significance does t occur in d? [term frequency] • Does t occur in many other documents also? [document frequency] • Called TF.iDF: TF× iDF, has many variants for TF and iDF • Variants for TF(t, d) • Number of times t occurs in d: freq(t, d) • Logarithmically scaled frequency: 1 + log(freq(t, d)) • Augmented frequency: avoid bias towards longer documents • Inverse document frequency of t : iDF(t) Half the score for just being present Rest a function of frequency • for all t in d; 0 otherwise • where N = total number of documents • DF(t) = number of documents in which t occurs
BM25 • Okapi IR system – Okapi BM25 • If the query q = {q1, … , qn} where qi’s are words in the query where N = total number of documents avgdl = average length of documents k1and b are optimized parameters, usually b = 0.75 and 1.2 ≤ k1 ≤ 2.0 • BM25 exhibited better performance than TF.iDF in TREC consistently
Relevance • Simple IR model: query, documents, returned results • Relevant document: a document that satisfies the information need expressed by the query • Merely matching query terms does not make a document relevant • Relevance is human perception, not a mathematical statement • User may want some statistics on population of India by the query “india statistics” • The document “Indian Statistical Institute” matches the query terms, but not relevant • To evaluate effectiveness of a system, we need for each query • Given a result, an assessment of whether it is relevant • The set of all relevant results assessed (pre-validated) • If the second is available, it serves the purpose of the first as well • Measures: precision, recall, F-measure (harmonic mean of precision and recall)
Inverted index 3 2 • Standard representation: document terms • Inverted index: term documents • For each term t, store the list of the documents in which t occurs 1 This is in Indian Statistical Institute, Kolkata, India Diwali is a huge festival in India Statistically flying is the safest mode of journey 5 India’s population is huge Thank god it is a holiday 4 This is autumn 7 6 There is no end of learning Scores?
Inverted index 3 2 • Standard representation: document terms • Inverted index: term documents • For each term t, store the list of the documents in which t occurs 1 This is in Indian Statistical Institute, Kolkata, India Diwali is a huge festival in India Statistically flying is the safest mode of journey 5 India’s population is huge Thank god it is a holiday 4 This is autumn 7 6 There is no end of learning Note: These scores are dummy, not by any formula
Positional index 3 2 • Just documents and scores follows bag of words model • Cannot perform proximity search or phrase query search • Positional inverted index: also store position of each occurrence of term t in each document d where t occurs 1 This is in Indian Statistical Institute, Kolkata, India Diwali is a huge festival in India Statistically flying is the safest mode of journey 5 India’s population is huge Thank god it is a holiday 4 This is autumn 7 6 There is no end of learning
Pre-processing • Removal of stopwords: of, the, and, … • Modern search does not completely remove stopwords • Such words add meaning to sentences as well as queries • Stemming: words stem (root) of words • Statistics, statistically, statistical statistic (same root) • Loss of slight information (the form of the word also matters) • But unifies differently expressed queries on the same topic • Lemmatization: doing this properly with morphological analysis of words • Normalization: unify equivalent words as much as possible • U.S.A, USA • Windows, windows • Stemming, lemmatization, normalization, synonym finding, all are important subfields on their own!!
Creating an inverted index 3 2 • For each document, write out pairs (term, docid) • Sort by term • Group, compute DF 1 This is in Indian Statistical Institute, Kolkata, India Diwali is a huge festival in India Statistically flying is the safest mode of journey 5 India’s population is huge Thank god it is a holiday 4 This is autumn 7 6 There is no end of learning
Traditional architecture User Different types of documents Results Query Query handler (query parsing) Results handler (displaying results) Basic format conversion, parsing Query Results Core query processing (accessing index, ranking) Analysis (stemming, normalization, …) Index Indexing
List 2 List 1 List 3 Query processing One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list lists sorted by doc id Pick the smallest doc id
List 2 List 1 List 3 Merge One pointer in each list (Partial) sort lists sorted by doc id still sorted by doc id Complexity? klogn Top-2 Merged list
Merge Simple and efficient, minimal overhead Merge Lists sorted by doc id Merged list But, have to scan the lists fully!
Top-k algorithms • If there are millions of documents in the lists • Can the ranking be done without accessing the lists fully? • Exact top-k algorithms (used more in databases) • Family of threshold algorithms (Ronald Fagin et al) • Threshold algorithm (TA) • No random access algorithm (NRA) [we will discuss, as an example] • Combined algorithm (CA) • Other follow up works • Inexact top-k algorithms • Exact top-k not required, the scores are only “crude” approximation of “relevance” (human perception) • Several heuristics • Further reading: IR book by Manning, Raghavan and Schuetze, Ch. 7
List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: lists sorted by score read one doc from every list
current score best-score List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 1 0.6 + 0.6 + 0.9 = 2.1 Candidates min top-2 score: 0.6 maximum score for unseen docs: 2.1 lists sorted by score min-top-2 < best-score of candidates read one doc from every list
List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 2 0.5 + 0.6 + 0.7 = 1.8 Candidates min top-2 score: 0.9 maximum score for unseen docs: 1.8 lists sorted by score min-top-2 < best-score of candidates read one doc from every list
List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 3 0.4 + 0.6 + 0.3 = 1.3 Candidates min top-2 score: 1.3 maximum score for unseen docs: 1.3 lists sorted by score min-top-2 < best-score of candidates no more new docs can get into top-2 but, extra candidates left in queue read one doc from every list
List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 4 0.3 + 0.6 + 0.2 = 1.1 Candidates min top-2 score: 1.3 maximum score for unseen docs: 1.1 lists sorted by score min-top-2 < best-score of candidates no more new docs can get into top-2 but, extra candidates left in queue read one doc from every list
List 2 List 1 List 3 NRA (No Random Access) Algorithm Fagin’s NRA Algorithm: round 5 More approaches: • Periodically also perform random accesses on documents to reduce uncertainty (CA) • Sophisticated scheduling on lists • Crude approximation: NRA may take a lot of time to stop. Just stop after a while with approximate top-k – who cares if the results are perfect according to the scores? 0.2 + 0.5 + 0.1 = 0.8 Candidates min top-2 score: 1.6 maximum score for unseen docs: 0.8 lists sorted by score no extra candidate in queue Done! read one doc from every list
References • Primarily: IR Book by Manning, Raghavan and Schuetze: http://nlp.stanford.edu/IR-book/