640 likes | 651 Views
Algoritmi per IR. Ranking. The big fight: find the best ranking. Ranking: Google vs Google.cn. Algoritmi per IR. Text-based Ranking (1° generation). Similarity between binary vectors. Document is binary vector X,Y in {0,1} D Score: overlap measure. What’s wrong ?. Normalization.
E N D
Algoritmi per IR Ranking
Algoritmi per IR Text-based Ranking (1° generation)
Similarity between binary vectors • Document is binary vector X,Y in {0,1}D • Score: overlap measure What’s wrong ?
Normalization • Dice coefficient (wrt avg #terms): • Jaccard coefficient (wrt possible terms): NO, triangular OK, triangular
What’s wrong in doc-similarity ? Overlap matching doesn’t consider: • Term frequency in a document • Talks more of t ? Then t should be weighted more. • Term scarcity in collection • of commoner than baby bed • Length of documents • score should be normalized
æ ö n = ç ÷ idf log where nt = #docs containing term t n = #docs in the indexed collection è ø t n t A famous “weight”: tf-idf = tf Frequency of term t in doc d = #occt / |d| t,d Vector Space model
Sec. 6.3 Why distance is a bad idea
Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !! Easy to Spam A graphical example t3 cos(a)= v w / ||v|| * ||w|| d2 d3 Sophisticated algos to find top-k docs for a query Q d1 a t1 d5 t2 d4 The user query is a very short doc
Sec. 6.3 cosine(query,document) Dot product qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
Cos for length-normalized vectors • For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized.
Sec. 6.3 Cosine similarity amongst 3 documents How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? Term frequencies (counts) Note: To simplify this example, we don’t do idf weighting.
Sec. 6.3 3 documents example contd. Log frequency weighting After length normalization cos(SaS,PaP) ≈ 0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?
Vector spaces and other operators • Vector space OK for bag-of-words queries • Clean metaphor for similar-document queries • Not a good combination with operators: Boolean, wild-card, positional, proximity • First generation of search engines • Invented before “spamming” web search
Algoritmi per IR Top-k retrieval
Sec. 7.1.1 Speed-up top-k retrieval • Costly is the computation of the cos • Find a set A of contenders, with K < |A| << N • A does not necessarily contain the top K, but has many docs from among the top K • Return the top K docs in A, according to the score • The same approach is also used for other (non-cosine) scoring functions • Will look at several schemes following this approach
Sec. 7.1.2 Index elimination • Consider docs containing at least one query term. Hence this means… • Take this further: • Only consider high-idf query terms • Only consider docs containing many query terms
Sec. 7.1.2 High-idf query terms only • For a query such as catcher in the rye • Only accumulate scores from catcher and rye • Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much • Benefit: • Postings of low-idf terms have many docs these (many) docs get eliminated from set A of contenders
Sec. 7.1.2 Docs containing many query terms • For multi-term queries, compute scores for docs containing several of the query terms • Say, at least 3 out of 4 • Imposes a “soft conjunction” on queries seen on web search engines (early Google) • Easy to implement in postings traversal
Sec. 7.1.2 3 2 4 4 8 8 16 16 32 32 64 64 128 128 1 2 3 5 8 13 21 34 3 of 4 query terms Antony Brutus Caesar Calpurnia 13 16 32 Scores only computed for docs 8, 16 and 32.
Champion Lists • Preprocess: Assign to each term, its m best documents • Search: • If |Q| = q terms, merge their preferred lists ( mq answers). • Compute COS between Q and these docs, and choose the top k. Need to pick m>k to work well empirically. Now SE use tf-idf PLUS PageRank (PLUS other weights)
Sec. 7.1.4 Complex scores • Consider a simple total score combining cosine relevance and authority • net-score(q,d) = PR(d) + cosine(q,d) • Can use some other linear combination than an equal weighting • Now we seek the top K docs by net score
Advanced: Fancy-hits heuristic • Preprocess: • Assign docID by decreasing PR weight • Define FH(t) =m docs for t with highest tf-idf weight • Define IL(t) = the rest (i.e. incr docID = decr PR weight) • Idea: a document that scores high should be in FH or in the front of IL • Search for a t-term query: • First FH: Take the common docs of their FH • compute the score of these docs and keep the top-k docs. • Then IL: scan ILs and check the common docs • Compute the score and possibly insert them into the top-k. • Stop when M docs have been checked or the PR score becomes smaller than some threshold.
Algoritmi per IR Speed-up querying by clustering
Sec. 7.1.6 Visualization Query Leader Follower
Sec. 7.1.6 Cluster pruning: preprocessing • Pick N docs at random: call these leaders • For every other doc, pre-compute nearest leader • Docs attached to a leader: its followers; • Likely: each leader has ~ N followers.
Sec. 7.1.6 Cluster pruning: query processing • Process a query as follows: • Given query Q, find its nearest leader L. • Seek K nearest docs from among L’s followers.
Sec. 7.1.6 Why use random sampling • Fast • Leaders reflect data distribution
Sec. 7.1.6 General variants • Have each follower attached to b1=3 (say) nearest leaders. • From query, find b2=4 (say) nearest leaders and their followers. • Can recur on leader/follower construction.
Algoritmi per IR Relevance feedback
Sec. 9.1 Relevance Feedback • Relevance feedback: user feedback on relevance of docs in initial set of results • User issues a (short, simple) query • The user marks some results as relevant or non-relevant. • The system computes a better representation of the information need based on feedback. • Relevance feedback can go through one or more iterations.
Sec. 9.1.1 Rocchio’s Algorithm • The Rocchio algorithm uses the vector space model to pick a relevance feed-back query • Rocchio seeks the query qopt that maximizes
Sec. 9.1.1 Rocchio (SMART) • Used in practice: • Dr =set of known relevant doc vectors • Dnr = set of known irrelevant doc vectors • qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically) • New query moves toward relevant documents and away from irrelevant documents
Relevance Feedback: Problems • Users are often reluctant to provide explicit feedback • It’s often harder to understand why a particular document was retrieved after applying relevance feedback • There is no clear evidence that relevance feedback is the “best use” of the user’s time.
Sec. 9.1.4 Relevance Feedback on the Web • Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback) • Google (link-based) • Altavista • Stanford WebBase • Some don’t because it’s hard to explain to users: • Alltheweb • bing • Yahoo • Excite initially had true relevance feedback, but abandoned it due to lack of use. α/β/γ ??
Sec. 9.1.6 Pseudo relevance feedback • Pseudo-relevance feedback automates the “manual” part of true relevance feedback. • Retrieve a list of hits for the user’s query • Assume that the top k are relevant. • Do relevance feedback (e.g., Rocchio) • Works very well on average • But can go horribly wrong for some queries. • Several iterations can cause query drift.
Sec. 9.2.2 Query Expansion • In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents • In query expansion, users give additional input (good/bad search term) on words or phrases
Sec. 9.2.2 How augment the user query? • Manual thesaurus (costly to generate) • E.g. MedLine: physician, syn: doc, doctor, MD • Global Analysis(static; all docs in collection) • Automatically derived thesaurus • (co-occurrence statistics) • Refinements based on query-log mining • Common on the web • Local Analysis (dynamic) • Analysis of documents in result set
Query assist Would you expect such a feature to increase the query volume at a search engine?
Algoritmi per IR Zone indexes
Sec. 6.1 Parametric and zone indexes • Thus far, a doc has been a term sequence • But documents have multiple parts: • Author • Title • Date of publication • Language • Format • etc. • These are the metadata about a document
Sec. 6.1 Zone • A zone is a region of the doc that can contain an arbitrary amount of text e.g., • Title • Abstract • References … • Build inverted indexes on fields AND zones to permit querying • E.g., “find docs with merchant in the title zone and matching the query gentle rain”
Sec. 6.1 Example zone indexes Encode zones in dictionary vs. postings.
Sec. 7.2.1 Tiered indexes • Break postings up into a hierarchy of lists • Most important • … • Least important • Inverted index thus broken up into tiers of decreasing importance • At query time use top tier unless it fails to yield K docs • If so drop to lower tiers
Sec. 7.2.1 Example tiered index
Sec. 7.2.2 Query term proximity • Free text queries: just a set of terms typed into the query box – common on the web • Users prefer docs in which query terms occur within close proximity of each other • Would like scoring function to take this into account – how?
Sec. 7.2.3 Query parsers • E.g. query rising interest rates • Run the query as a phrase query • If <K docs contain the phrase rising interest rates, run the two phrase queries rising interest and interest rates • If we still have <K docs, run the vector space query rising interest rates • Rank matching docs by vector space scoring
Algoritmi per IR Quality of a Search Engine
Is it good ? • How fast does it index • Number of documents/hour • (Average document size) • How fast does it search • Latency as a function of index size • Expressiveness of the query language