390 likes | 466 Views
Crowdsourcing Preference Judgments for Evaluation of Music Similarity Tasks. Julián Urbano , Jorge Morato , Mónica Marrero and Diego Martín jurbano@inf.uc3m.es. SIGIR CSE 2010 Geneva, Switzerland · July 23rd. Outline. Introduction Motivation Alternative Methodology
E N D
CrowdsourcingPreference Judgments for Evaluation of Music Similarity Tasks JuliánUrbano, Jorge Morato,Mónica Marrero and Diego Martín jurbano@inf.uc3m.es SIGIR CSE 2010Geneva, Switzerland · July 23rd
Outline • Introduction • Motivation • Alternative Methodology • Crowdsourcing Preferences • Results • Conclusions and Future Work
Evaluation Experiments • Essential for Information Retrieval [Voorhees, 2002] • Traditionally followed the Cranfield paradigm • Relevance judgments are the most important part of test collections (and the most expensive) • In the music domain evaluation has not been taken too seriously until very recently • MIREX appeared in 2005 [Downie et al., 2010] • Additional problems with the construction and maintenance of test collections [Downie, 2004]
Music Similarity Tasks • Given a music piece (i.e. the query) return a ranked list of other pieces similar to it • Actual music contents, forget the metadata! • It comes in two flavors • Symbolic Melodic Similarity (SMS) • Audio Music Similarity (AMS) • It is inherently more complex to evaluate • Relevance judgments are very problematic
Relevance (Similarity) Judgments • Relevance is usually considered on a fixed scale • Relevant, not relevant, very relevant… • For music similarity tasks relevance is rather continuous [Selfridge-Field, 1998][Typke et al., 2005][Jones et al., 2007] • Single melodic changes are not perceived to change the overall melody • Move a note up or down in pitch, shorten it, etc. • But the similarity is weaker as more changes apply • Where is the line between relevance levels?
Partially Ordered Lists • The relevance of a document is implied by its position in a partially ordered list [Typke et al., 2005] • Does not need any prefixed relevance scale • Ordered groups of documents equally relevant • Have to keep the order of the groups • Allow permutations within the same group • Assessors only need to be sure that any pair of documents is ordered properly
Partially Ordered Lists (and III) • Used in the first edition of MIREX in 2005[Downie et al., 2005] • Widely accepted by the MIR communityto report new developments [Urbano et al., 2010a][Pinto et al., 2008][Hanna et al., 2007][Gratchen et al., 2006] • MIREX was forced to move to traditionallevel-based relevance since 2006 • Partially ordered lists are expensive • And have some inconsistencies
Expensiveness • The ground truth for just 11 queries took 35 music experts for 2 hours [Typke et al., 2005] • Only 11 of them had time to work on all 11 queries • This exceeds MIREX’s resources for a single task • MIREX had to move to level-based relevance • BROAD: Not Similar, Somewhat Similar, Very Similar • FINE: numerical, from 0 to 10 with one decimal digit • Problems with assessor consistency came up
Issues with Assessor Consistency • The line between levels is certainly unclear[Jones et al., 2007][Downie et al., 2010]
Original Methodology • Go back to partially ordered lists • Filter the collection • Have the experts rank the candidates • Arrange the candidates by rank • Aggregate candidates whose ranks are not significantly different (Mann-Whitney U) • There are known odd results and inconsistencies [Typke et al., 2005][Hanna et al., 2007][Urbano et al., 2010b] • Disregard changes that do not alter the actual perception, such as clef or key and time signature • Something like changing the language of a text and use synonyms [Urbano et al., 2010a]
Alternative Methodology • Minimize inconsistencies [Urbano et al., 2010b] • Cheapen the whole process • Reasonable Person hypothesis [Downie, 2004] • With crowdsourcing (finally) • Use Amazon Mechanical Turk • Get rid of experts [Alonso et al., 2008][Alonso et al., 2009] • Work with “reasonable turkers” • Explore other domains to apply crowdsourcing
Equally Relevant Documents • Experts were forced to give totally ordered lists • One would expect ranks to randomly average out • Half the experts prefer one document • Half the experts prefer the other one • That is hardly the case • Do not expect similar ranks if the expertscan not give similar ranks in the first place
Give Audio instead of Images • Experts may guide by the images, not the music • Some irrelevant changes in the image can deceive • No music expertise should be needed • Reasonable personturker hypothesis
Preference Judgments • In their heads, experts actually do preference judgments • Similar to a binary search • Accelerates assessor fatigue as the list grows • Already noted for level-based relevance • Go back and re-judge [Downie et al., 2010][Jones et al., 2007] • Overlapping between BROAD and FINE scores • Change the relevance assessment question • Which is more similar to Q: A or B? [Carterette et al., 2008]
Preference Judgments (II) • Better than traditional level-based relevance • Inter-assessor agreement • Time to answer • In our case, three-point preferences • A < B (A is more similar) • A = B (they are equally similar/dissimilar) • A > B (B is more similar)
Preference Judgments (and III) • Use a modified QuickSort algorithm to sort documents in a partially ordered list • Do not need all O(n2) judgments, but O(n·log n) X is the current pivot on the segment X has been pivot already
How Many Assessors? • Ranks are given to each document in a pair • +1 if it is preferred over the other one • -1 if the other one is preferred • 0 if they were judged equally similar/dissimilar • Test for signed differences in the samples • In the original lists 35 experts were used • Ranks of a document between 1 and more than 20 • Our rank sample is less (and equally) variable • rank(A) = -rank(B) ⇒ var(A) = var (B) • Effect size is larger so statistical power increases • Fewer assessors are needed overall
Crowdsourcing Preferences • Crowdsourcing seems very appropriate • Reasonable person hypothesis • Audio instead of images • Preference judgments • QuickSort for partially ordered lists • The task can be split into very small assignments • It should be much more cheap and consistent • Do not need experts • Do not deceive and increase consistency • Easier and faster to judge • Need fewer judgments and judges
New Domain of Application • Crowdsourcing has been used mainly to evaluate text documents in English • How about other languages? • Spanish [Alonso et al., 2010] • How about multimedia? • Image tagging? [Nowak et al., 2010] • Music similarity?
Data • MIREX 2005 Evaluation collection • ~550 musical incipits in MIDI format • 11 queries also in MIDI format • 4 to 23 candidates per query • Convert to MP3 as it is easier to play in browsers • Trim the leading and tailing silence • 1 to 57 secs. (mean 6) to 1 to 26 secs. (mean 4) • 4 to 24 secs. (mean 13) to listen to all 3 incipits • Uploaded all MP3 files and a Flash player to a private server to stream data on the fly
HIT Design 2 yummy cents of dollar
Threats to Validity • Basically had to randomize everything • Initial order of candidates in the first segment • Alternate between queries • Alternate between pivots of the same query • Alternate pivots as variations A and B • Let the workers know about this randomization • In first trials some documents were judged more similar to the query than the query itself! • Require at least 95% acceptance rate • Ask for 10 different workers per HIT [Alonso et al., 2009] • Beware of bots (always judged equal in 8 secs.)
Summary of Submissions • The 11 lists account for 119 candidates to judge • Sent 8 batches (QuickSort iterations) to MTurk • Had to judge 281 pairs (38%) = 2810 judgments • 79 unique workers for about 1 day and a half • A total cost (excluding trials) of $70.25
Feedback and Music Background • 23 of the 79 workers gave us feedback • 4 very positive comments: very relaxing music • 1 greedy worker: give me more money • 2 technical problems loading the audio in 2 HITs • Not reported by any of the other 9 workers • 5 reported no music background • 6 had formal music education • 9 professional practitioners for several years • 9 play an instrument, mainly piano • 6 performers in choir
Agreement between Workers • Forget about Fleiss’ Kappa • Does not account for the size of the disagreement • A<B and A=B is not as bad as A<B and B<A • Look at all 45 pairs of judgments per pair • +2 if total agreement (e.g. A<B and A<B) • +1 if partial agreement (e.g. A<B and A=B) • 0 if no agreement (i.e. A<B and B<A) • Divide by 90 (all pairs with total agreement) • Average agreement score per pair was 0.664 • From 0.506 (iteration 8) to 0.822 (iteration 2)
Agreement Workers-Experts • Those 10 judgments were actually aggregated Percentages per row total • 155 (55%) total agreement • 102 (36%) partial agreement • 23 (8%) no agreement • Total agreement score = 0.735 • Supports the reasonable person hypothesis
Agreement (Summary) • Very similar judgments overall • The reasonable person hypothesis stands still • Crowdsourcing seems a doable alternative • No music expertise seems necessary • We could use just one assessor per pair • If we could keep him/her throughout the query
Ground Truth Similarity • Do high agreement scores translate intohighly similar ground truth lists? • Consider the original lists (All-2) as ground truth • And the crowdsourced lists as a system’s result • Compute the Average Dynamic Recall [Typke et al., 2006] • And then the other way around • Also compare with the (more consistent) original lists aggregated in Any-1 form [Urbano et al., 2010b]
Ground Truth Similarity (II) • The result depends on the initial ordering • Ground truth = (A, B, C), (D, E) • Results1 = (A, B), (D, E, C) • ADR score = 0.933 • Results2 =(A, B), (C, D, E) • ADR score = 1 • Results1 is identical to Results2 • Generate 1000 (identical) versions by randomly permuting the documents within a group
Ground Truth Similarity (and III) Min. and Max. between square brackets • Very similar to the original All-2 lists • Like the Any-1 version, also more restrictive • More consistent (workers were not deceived)
MIREX 2005 Revisited • Would the evaluation have been affected? • Re-evaluated the 7 systems that participated • Included our Splines system [Urbano et al., 2010a] • All systems perform significantly worse • ADR score drops between 9-15% • But their ranking is just the same • Kendall’s τ= 1
Conclusions • Partially ordered lists should come back • We proposed an alternative methodology • Asked for three-point preference judgments • Used Amazon Mechanical Turk • Crowdsourcing can be used for music-related tasks • Provided empirical evidence supporting the reasonable person hypothesis • What for? • More affordable and large-scale evaluations
Conclusions (and II) • We need fewer assessors • More queries with the same man-power • Preferences are easier and faster to judge • Fewer judgments are required • Sorting algorithm • Avoid inconsistencies (A=B option) • Using audio instead of images gets rid of experts • From 70 expert hours to 35 hours for $70
Future Work • Choice of pivots in the sorting algorithm • e.g. the query itself would not provide information • Study the collections for Audio Tasks • They have more data • Inaccessible • But no partially ordered list (yet) • Use our methodology with one real expert judging preferences for the same query • Try crowdsourcing too with one single worker
Future Work (and II) • Experimental study on the characteristics of music similarity perception by humans • Is it transitive? • We assumed it is • Is it symmetrical? • If these properties do not hold we have problems • Id they do, we can start thinking on Minimal and Incremental Test Collections[Carterette et al., 2005]
And That’s It! Picture by 姒儿喵喵