190 likes | 199 Views
This lecture discusses experiments on distance measures in the rank aggregation problem, including Spearman footrule distance, Kendall tau distance, induced footrule distance, and scaled footrule distance.
E N D
Rank Aggregation Methods IIExperiments CS728 Lecture 12
Recall the Rank Aggregation Problem • m candidates (a.k.a. “alternatives”) • M = {1,…,m}: set of candidates • n voters (a.k.a. “agents” or “judges”) • N = {1,…,n}: set of voters • Each voter i, has an ranking i on M • i(a) < i(b) means i-th voter prefers a to b • Ranking may be a total or partial order • The rank aggregation problem: Combine 1,…,n into a single ranking on M, which represents the “social choice” of the voters. • Rank aggregation function: f(1,…,n) = • may be a total or partial order
Experiments: Distance Measures Goal: Quantitatively compare different rank aggregation methods. Performance Measures: (1) Spearman footrule distance is sum of pointwise distances. It is normalized by dividing this number by the maximum value (1/2)|S|2, value between 0 and 1. (2) Kendall tau distance counts the number of pairwise disagreements. Dividing by the maximum possible value (1/2)S(S - 1) we obtain a normalized version, value between 0 and 1. (3) The induced footrule distance is obtained by taking the projections of a full list s with each partial list. In a similar manner, induced Kendall tau distance can be defined. (4) The scaled footrule distance weights contributions of elements based on the length of the lists they are present in. If s is a full list and t is a partial list, then: SF(s, t) = Sum | s(i)/|s|) - (t(i)/|t|) |. Normalize SF by dividing by |t|/2.
Experiments: Distance Measures • So for each aggregation method and each distance measure we get a vector of values, each component representing a distance to from the aggregation to each voter list • Simplest is to take the average (or 1-norm) • Other norms are interesting • Mean square distance (2-norm) • Max distance (∞-norm)
Experiments: Minimizing AverageAltavista (AV), Alltheweb (AW), Excite (EX), Google (GG), Hotbot HB),Lycos (LY), and Northernlight (NL) K = Kendall distance SF = scaled footrule distance IF = induced footrule distance LK = Local Kemenization
Experiments in Spam Filtering • Define spam to be web pages are low-ranked by majority opinion (machine and human – a simplifying assumption) – although they may be highly ranked by some search engines • Intuition: if a page spams most search engines for a particular query, then no combination of these search engines can filter the spam.---garbage in, garbage out. • Spam pages are the Condorcet losers, and will occupy the bottom of ranking that satisfies the extended Condorcet criterion • Similarly, good pages will be in the Condorcet winners, and will rank above the losers.
Condorcet Criteria • Condorcet Criterion • An candidate of M which wins every other in pairwise simple majority voting should be ranked first. • Extended Condorcet Criterion (XCC): • Version 1: If most voters prefer candidate a to candidate b (i.e., # of i s.t. i(a) < i(b) is at least n/2), then also should prefer a to b (i.e., (a) < (b)). • Version 2: If there is a partition (W, L) of M such that for any x in W and y in L the majority prefers x to y, then x must be ranked above y. W is called Condorcet winners and L is Condorcet losers
XCC(2) and SPAM Filtering • Note that XCC(1) => XCC(2), so Version 1 is stronger • But XCC(1) is not always realizable • As we will see XCC(2) is always realizable via Local Keminization • Hence using rank aggregation with XCC(2) should assist in SPAM filtering, since Condorcet losers will be lowest rank • Let us look at where spam pages (human determined) are ranked with good aggregation methods.
Experiment: Word association • Different search engines and portals have different (default) semantics of handling a multi-word query. • Some use OR semantics (documents contain one of the given query terms) while Google uses the AND semantics (all the query words must appear). Both inconvenient in many situations. • Consider searching for the job of a software engineer from an on-line job database. The user lists a number of skills and a number of potential keywords in the job description, for example, "Silicon Valley C++ Java CORBA TCP-IP algorithms start-up pre-IPO stock options". It is clear that the "AND" rule might produce no document or SPAM, and the "OR" rule is equally disastrous. • Experiment with rank aggregation using multiple queries based on small subsets of terms.
Results for query: madras madurai coimbatore vellore. (cities in the state of Tamil Nadu, India) • Google www.mssrf.org/Fris9809/location-tamilnadu.htmlwww.indiaplus.com/Info/schools.htmlwww.focustamilnadu.com/tamilnadu/Policy%20Note ...Forests.htmlwww.tn.gov.in/policy/environ.htmwww.indiacolleges.com/Tamil_Nadu.htm • SFO with LK www.madurai.comwww.ozemail.com.au/clday/locations.htmwww.utoledo.edu/homepages/speelam/coimbatore.htmlwww.ozemail.com.au/clday/madras.htmwww.madurai.com/around.htmwww.indiatraveltimes.com/tamilnadu/tamil1.html • MC4 with LK www.madurai.comwww.surfindia.com/omsakthi/tourism.htmwww.indiatraveltimes.com/tamilnadu/tamil1.htmlwww.indiatraveltimes.com/tamilnadu/tamil2.htmlwww.indiatravels.com/forts/vellore_fort.htmwww.india-tourism.de/english/south/tamil_nadu.html
Locally Kemeny optimal aggregation and XCC(2) • Many of existing aggregation methods do not satisfy XCC(1) or XCC(2). • It is possible to use your favorite aggregation method to obtain a full list. Then apply local kemenization to realize XCC(2) which filters Condorcet losers.
Locally Kemeny optimal • Recall that Kemeny optimal is NP-hard • Definition of locally optimalA permutation p is a locally Kemeny optimal aggregation of partial lists t1, t2, ..., tk, if there is no permutation p' that can be obtained from p by performing a single transposition of an adjacent pair of elements and for which Kendal distance K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk). In other words, it is impossible to reduce the total distance to the t's by flipping an adjacent pair.
Example of LKO but not KO • Example 1 • t1 = (1,2), t2 = (2,3), t3 = t4 = t5 = (3,1). • p = (1,2,3), We have that p satisfies Definition of LKO, K(p, t1, t2, ..., t5)= 3, but transposing 1 and 3 decreases the sum to 2.
LKO satisfies XCC(2) • Proof by contradictionIf the result is false then there exist partial lists t1, t2, ..., tk, a LKO aggregation p, and a partition (W,L) that violates XCC(2); that is some pair c in W and d in L, such that p(d) < p(c). Let (c,d) be the closest such pair in p. • Consider the immediate successor of d in p, call it e. If e=c then c is adjacent to d in p and transposing this adjacent pair of alternatives produces a p' such that K(p', t1, t2, ..., tk) < K(p, t1, t2, ..., tk), contradicting the assumption on p. • If e does not equal c, then either e is in W, in which case the pair (e,d) is a closer pair in p than (d,c) and also violates the XCC(2), or e is in L, in which case (e,c) is a closer pair than (d,c) that violates XCC(2). Both cases contradict the choice of (d,c).
Local Kemenization procedure • A local Kemenization of a full list with respect to preference lists so as to compute a locally Kemeny optimal aggregationthat is maximally consistent with original. This approach: (1) preserves the strengths of the initial aggregation (2) ranks non-spam above spam. (3) gives a result that disagrees with original on any pair (i, j) only if a majority endorse this disagreement. (4) for every d, 1 ≤ d ≤ | μ |, the restriction of the output is a local Kemenization of the top d elements of μ
Local Kemenization procedure • A simple inductive construction. • Assume inductively for that we have constructed p, a local Kemenization of the projection of the t's onto the elements 1, ..., l-1. • Insert next element x into the lowest-ranked "permissible" position in p: just below the lowest-ranked element y in p such that • (a) no majority among the (original) t's prefers x to y and • (b) for all successors z of y in p there is a majority that prefers x to z. • In other words, we try to insert x at the end (bottom) of the list p; we bubble it up toward the top of the list as long as a majority of the t's insists that we do.
Example local kemenization procedure • Local Kemenization Example! A B F E C D B C A E F D A C F D E B B F D C A E C A B F E D B A DC E F A B D B A B A B CF E D A B DC A B CD B A disagree A>B: 3 A<B: 2 B>D: 4 B<D: 1
RA and Searching Workplace Web • Axiom 1: Intranet documents are not spam • Axiom 2: Queries usually have unique answers (not broad topic based) • Axiom 3: Intranet docs are not search engine friendly (docs are accessed through portals and database queries • Rank aggregation allows us to combine number of heuristic alternatives: static and dynamic, query dependent and independent