90 likes | 238 Views
Discussion Class 6. Ranking Algorithms. Discussion Classes. Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear.
E N D
Discussion Class 6 Ranking Algorithms
Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear
Question 1: Inverted Document Frequency (IDF) In class, I first introduced Salton's original term weighting, known as Inverted Document Frequency: wik = fik / dk The reading gives Sparck Jones's term weighting, Inverted Document Frequency (IDF): IDFi= log2 (N/ni)+ 1 or IDFi= log2 (maxn/ni)+ 1 What is the relationship between these alternatives?
Q1 (continued): Definitions of Terms wik weight given to term k in document i fik frequency with which term k appears in document i dk number of documents that contain term k N number of documents in the collection ni total number of occurrences of term i in the collection maxn maximum frequency of any term in the collection
Question 2: Within-Document Frequency (a) Why does term weighting using within document frequency improve ranking? (b) Why is it necessary to normalize within-document frequency? (c) Explain Croft's normalization: cfreqij = K + (1 - K) freqij/maxfreqj (d) How does Salton and Buckley's recommendation term weighting fit with Croft's normalization?
Question 3: Salton/Buckley Recommendation similarity (Q,D) = t t t (wiq x wij) i = 1 i = 1 i = 1 ( ) wiq= 0.5 + x IDFi wiq2 x wij2 0.5 freqiq maxfreqq where and wij= freqij x IDFj freqiq = frequency of term i in query q maxfreqq = maximum frequency of any term in query q IDFi = IDF of term i in entire collection freqij = frequency of term i in document j
Question4: Zipf's Law "... significant performance inprovement using ... the inverted document frequency ... that is based on Zipf's distribution ..." What has Zipf's law to do with IDF?
Question 4: Probabilistic Models The section on probabilistic models is rather unsatisfactory because it relies on a mathematical foundation that has been left out. Can you summarize the basic ideas?
Question 5: TF.IDF compared with Google PageRank (a) TF.IDF and PageRank are based on fundamentally different considerations. What are the fundamental differences? (b) Under which circumstances would you expect each to excel?