Algorithms for Large Data Sets

Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 http://www.ee.technion.ac.il/courses/049011

Information Retrieval

Information Retrieval Setting “Information Need” I want information about Michael Jordan, the machine learning expert User Document Collection query feedback IR System +”Michael Jordan” -basketball No. 1 is good, Rest are bad • Michael I. Jordan’s • homepage • NBA.com • Michael Jordan on TV • Michael I. Jordan’s • homepage • M.I. Jordan’s pubs • Graphical Models Revised ranked list of retrieved documents Ranked list of retrieved documents documents

Information Retrievalvs.Data Retrieval • Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus. • Ex: Get documents about Michael Jordan, the machine learning expert. • Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. • Ex: SELECT doc FROM corpus WHERE (doc.text CONTAINS “Michael Jordan”) AND NOT (doc.text CONTAINS “basketball”).

Information Retrieval vs. Data Retrieval

Information Retrieval Systems ranked retrieved docs IR System query processor user query ranking procedure retrieved docs User system query index postings Corpus text processor indexer tokenized docs raw docs

Search Engines ranked retrieved docs Search Engine query processor user query ranking procedure retrieved docs User system query index postings Web text processor indexer tokenized docs crawler repository global analyzer

Classical IR vs. Web IR

Outline • Abstract formulation • Models for relevance ranking • Retrieval evaluation • Query languages • Text processing • Indexing and searching

Abstract Formulation • Ingredients: • D: document collection • Q: query space • f: D x Q  R: relevance scoring function • For every q in Q, f induces a ranking (partial order) q on D • Functions of an IR system: • Preprocess D and create an index I • Given q in Q, use I to produce a permutation  on D • Goals: • Accuracy: should be “close” to q • Compactness: index should be compact • Response time: answers should be given quickly

Document Representation • T = { t1,…, tk }: a “token space” • (a.k.a. “feature space” or “term space”) • Ex: all words in English • Ex: phrases, URLs, … • A document: a real vector d in Rk • di: “weight” of token ti in d • Ex: di = normalized # of occurrences of ti in d

Classic IR (Relevance) Models • The Boolean model • The Vector Space Model (VSM)

The Boolean Model • A document: a boolean vector d in {0,1}k • di = 1 iff ti belongs to d • A query: a boolean formula q over tokens • q: {0,1}k {0,1} • Ex: “Michael Jordan” AND (NOT basketball) • Ex: +“Michael Jordan” –basketball • Relevance scoring function: f(d,q) = q(d)

The Boolean Model: Pros & Cons • Advantages: • Simplicity for users • Disadvantages: • Relevance scoring is too coarse

The Vector Space Model (VSM) • A document: a real vector d in Rk • di = weight of ti in d (usually TF-IDF score) • A query: a real vector q in Rk • qi = weight of ti in q • Relevance scoring function: f(d,q) = sim(d,q) • “similarity” between d and q

Popular Similarity Measures d d –q • L1 or L2 distance • d,q are first normalized to have unit norm • Cosine similarity q d  q

TF-IDF Score: Motivation • Motivating principle: • A term ti is relevant to a document d if: • ti occurs many times in d relative to other terms that occur in d • ti occurs many times in d relative to its number of occurrences in other documents • Examples • 10 out of 100 terms in d are “java” • 10 out of 10,000 terms in d are “java” • 10 out of 100 terms in d are “the”

TF-IDF Score: Definition • n(d,ti) = # of occurrences of ti in d • N = i n(d,ti) (# of tokens in d) • Di = # of documents containing ti • D = # of documents in the collection • TF(d,ti): “Term Frequency” • Ex: TF(d,ti) = n(d,ti) / N • Ex: TF(d,ti) = n(d,ti) / (maxj { n(d,tj) }) • IDF(ti): “Inverse Document Frequency” • Ex: IDF(ti) = log (D/Di) • TFIDF(d,ti) = TF(d,ti) x IDF(ti)

VSM: Pros & Cons • Advantages: • Better granularity in relevance scoring • Good performance in practice • Efficient implementations • Disadvantages: • Assumes term independence

Retrieval Evaluation • Notations: • D: document collection • Dq: documents in D that are “relevant” to query q • Ex: f(d,q) is above some threshold • Lq: list of results on query q D Lq Dq Recall: Precision:

Recall & Precision: Example List A List B Relevant docs: d123, d56, d9, d25, d3 • Recall(A) = 80% • Precision(A) = 40% • d123 • d84 • d56 • d6 • d8 • d9 • d511 • d129 • d187 • d25 • d81 • d74 • d56 • d123 • d511 • d25 • d9 • d129 • d3 • d5 • Recall(B) = 100% • Precision(B) = 50%

Precision@k and Recall@k • Notations: • Dq: documents in D that are “relevant” to q • Lq,k: top k results on the list Recall@k: Precision@k:

Precision@k: Example List A List B • d123 • d84 • d56 • d6 • d8 • d9 • d511 • d129 • d187 • d25 • d81 • d74 • d56 • d123 • d511 • d25 • d9 • d129 • d3 • d5

Recall@k: Example List A List B • d123 • d84 • d56 • d6 • d8 • d9 • d511 • d129 • d187 • d25 • d81 • d74 • d56 • d123 • d511 • d25 • d9 • d129 • d3 • d5

“Interpolated” Precision • Notations: • Dq: documents in D that are “relevant” to q • r: a recall level (e.g., 20%) • k(r): first k so that recall@k >= r Interpolated precision@ recall level r = max { precision@k : k >= k(r) }

Precision vs. Recall: Example List A List B • d123 • d84 • d56 • d6 • d8 • d9 • d511 • d129 • d187 • d25 • d81 • d74 • d56 • d123 • d511 • d25 • d9 • d129 • d3 • d5

Query Languages: Keyword-Based • Singe-word queries • Ex: Michael Jordan machine learning • Context queries • Phrases. Ex: “Michael Jordan” “machine learning” • Proximity. Ex: “Michael Jordan” at distance of at most 10 words from “machine learning” • Boolean queries • Ex: +”Michael Jordan” –basketball • Natural language queries • Ex: “Get me pages about Michael Jordan, the machine learning expert.”

Query Languages: Pattern Matching • Prefixes • Ex: prefix:comput • Suffixes • Ex: suffix:net • Regular Expressions • Ex: [0-9]+th world-wide web conference

Text Processing • Lexical analysis & tokenization • Split text into words, downcase letters, filter out punctuation marks, digits, hyphens • Stopword elimination • Better retrieval accuracy, more compact index • Ex: “to be or not to be” • Stemming • Ex: “computer”, “computing”, “computation”  comput • Index term selection • Keywords vs. full text

Inverted Index VocabularyPostings author: (d1,4) berkeley: (d1,13) date: (d2,9) famous: (d2, 2) graphical: (d1,6) jordan: (d1,2), (d2,6) legend: (d2,4) like: (d2,7) michael: (d1,1), (d2,5) model: (d1,7), (d2,10) nba: (d2,3) professor: (d1,10) uc: (d1,12) Michael1 Jordan2, the3 author4of5 “graphical6 models7”, is8a9 professor10at11 U.C.12 Berkeley13. d1 The1 famous2 NBA3 legend4 Michael5 Jordan6 liked7to8 date9 models10. d2

Inverted Index Structure Vocabulary File Postings File term1 term2 … postings list 1 postings list 2 … • Usually, fits in main memory • Stored on disk

Searching an Inverted Index • Given: • t1, t2: query terms • L1,L2: corresponding posting lists • Need to get ranked list of docs in intersection of L1,L2 • Solution 1: If L1,L2 are comparable in size, “merge” L1 and L2 to find docs in their intersection, and then order them by rank. (running time: O(|L1| + |L2|)) • Solution 2: If L1 is considerably shorter than L2, binary search each posting of L1 in L2 to find the intersection, and then order them by rank. (running time: O(|L1| x log(|L2|))

Search Optimization • Improvement: Order docs in posting lists by static rank (e.g., PageRank). • Then, can output top matches, without scanning the whole lists.

Index Construction • Given a stream of documents, store (did,tid,pos) triplets in a file • Sort and group file by tid • Extract posting lists

Index Maintenance • Naïve updates of inverted index can be very costly • Require random access • A single change may cause many insertions/deletions • Batch updates • Two indices • Main index (created in batch, large, compressed) • “Stop-press” index (incremental, small, uncompressed)

Index Maintenance • If a page d is inserted/deleted, the “signed” postings (did,tid,pos,I/D) are added to the stop-press index. • Given a query term t, fetch its list Lt from main index, and two lists Lt,+ and Lt,- from stop-press index. • Result is: • When stop-press index grows too large, it is merged into the main index.

Index Compression • Delta compression • Saves a lot for popular terms • Doesn’t save much for rare terms (but these don’t take much space anyway) michael: (1000007,5), (1000009,12), (1000013,77), (1000035,88),… michael: (1000007,5), (2,12), (4,77), (22,88),…

Variable Length Encodings • How to encode gaps succinctly? • Option 1: Fixed-length binary encoding. • Effective when all gap lengths are equally likely • No savings over storing doc ids. • Option 2: Unary encoding. • Gap x is encoded by x-1 1’s followed by a 0 • Effective when large gaps are very rare (Pr(x) = 1/2x) • Option 3: Gamma encoding. • Gap x is encoded by (xx), where x is the binary encoding of x and x is the length of x, encoded in unary. • Encoding length: about 2log(x).

End of Lecture 2

Algorithms for Large Data Sets