160 likes | 269 Views
Evaluating the Use of Clustering for Automatically Organising Digital Library Collections. Mark M. Hall, Mark Stevenson, Paul D. Clough. Opening Up Digital Cultural Heritage. http://www.flickr.com/photos/brokenthoughts/122096903/. Carl Collins
E N D
Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. Clough TPDL 2012, Cyprus, 24-27 September 2012
Opening Up Digital Cultural Heritage http://www.flickr.com/photos/brokenthoughts/122096903/ Carl Collins http://www.flickr.com/photos/carlcollins/199792939/ http://www.flickr.com/photos/usnationalarchives/4069633668/ TPDL 2012, Cyprus, 24-27 September 2012
Exploring Collections • Exploring / Browsing as an alternative to Search (where applicable) • Requires some kind of structuring of the data • Manual structuring ideal • Expensive to generate • Integration of collections problematic • Alternative: Automatic structuring via clustering TPDL 2012, Cyprus, 24-27 September 2012
Test Collection • 28133 photographs provided by the University of St Andrews Library • 85% pre 1940 • 89% black and white • Majority UK • Title and description tend to be short Ottery St Mary Church TPDL 2012, Cyprus, 24-27 September 2012
Tested Clustering Strategies • Latent Dirichlet Allocation (LDA) • 300 & 900 topics • With and without Pairwise Mutual Information (PMI) filtering • K-Means • 900 clusters • TFIDF vectors & LDA topic vectors • OPTICS • 900 clusters • TFIDF vectors & LDA topic vectors TPDL 2012, Cyprus, 23-27 September 2012
Processing Time TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics • Cluster cohesion • Items in a cluster should be similar to each other • Items in a cluster should be different from items in other clusters • How to test this? • “Intruder” test • If you insert an intruder into a cluster, can people find it TPDL 2012, Cyprus, 24-27 September 2012
Intruder Test • Randomly select one topic • Randomly select four items from the topic • Randomly select a second topic – the “intruder” topic • Randomly select one item from the second topic – the “intruder” item • Scramble the five items and let the user choose which one is the “intruder” TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Cohesive TPDL 2012, Cyprus, 24-27 September 2012
Cluster Cohesion – Not Cohesive TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Metrics • Cohesive • “Intruder” is chosen significantly more frequently than by chance • Choice distribution is significantly different from the uniform distribution • Borderline cohesive • Two out of five items make up > 95% of the answers • “Intruder” is one of those two TPDL 2012, Cyprus, 24-27 September 2012
Evaluation Bounds • Upper bound • Manual annotation • 936 topics • Lower bound • 3 cohesive topics • <5% likelihood of seeing that number of cohesive topics by chance • Control data • 10 “really, totally, completely obvious” intruders used to filter participants who randomly select answers TPDL 2012, Cyprus, 24-27 September 2012
Experiment • Crowd-sourced using staff & students at Sheffield University • 700 participants • 9 clustering strategies • 30 units per strategy – total of 270 units • Results • 8840 ratings • 21 – 30 ratings per unit (median 27 ratings) TPDL 2012, Cyprus, 24-27 September 2012
Results TPDL 2012, Cyprus, 24-27 September 2012
Conclusions • K-means almost as good as the human classification • LDA is very fast and approximately two thirds of the topics are acceptably cohesive • Future work: • Make it hierarchical • Create hybrid algorithms TPDL 2012, Cyprus, 24-27 September 2012
Thank you for listening m.mhall@sheffield.ac.uk Find out more about the project: http://www.paths-project.eu The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).