Evaluating the Use of Clustering for Automatically Organising Digital Library Collections

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections Mark M. Hall, Mark Stevenson, Paul D. Clough TPDL 2012, Cyprus, 24-27 September 2012

Opening Up Digital Cultural Heritage http://www.flickr.com/photos/brokenthoughts/122096903/ Carl Collins http://www.flickr.com/photos/carlcollins/199792939/ http://www.flickr.com/photos/usnationalarchives/4069633668/ TPDL 2012, Cyprus, 24-27 September 2012

Exploring Collections • Exploring / Browsing as an alternative to Search (where applicable) • Requires some kind of structuring of the data • Manual structuring ideal • Expensive to generate • Integration of collections problematic • Alternative: Automatic structuring via clustering TPDL 2012, Cyprus, 24-27 September 2012

Test Collection • 28133 photographs provided by the University of St Andrews Library • 85% pre 1940 • 89% black and white • Majority UK • Title and description tend to be short Ottery St Mary Church TPDL 2012, Cyprus, 24-27 September 2012

Tested Clustering Strategies • Latent Dirichlet Allocation (LDA) • 300 & 900 topics • With and without Pairwise Mutual Information (PMI) filtering • K-Means • 900 clusters • TFIDF vectors & LDA topic vectors • OPTICS • 900 clusters • TFIDF vectors & LDA topic vectors TPDL 2012, Cyprus, 23-27 September 2012

Processing Time TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Metrics • Cluster cohesion • Items in a cluster should be similar to each other • Items in a cluster should be different from items in other clusters • How to test this? • “Intruder” test • If you insert an intruder into a cluster, can people find it TPDL 2012, Cyprus, 24-27 September 2012

Intruder Test • Randomly select one topic • Randomly select four items from the topic • Randomly select a second topic – the “intruder” topic • Randomly select one item from the second topic – the “intruder” item • Scramble the five items and let the user choose which one is the “intruder” TPDL 2012, Cyprus, 24-27 September 2012

Cluster Cohesion – Cohesive TPDL 2012, Cyprus, 24-27 September 2012

Cluster Cohesion – Not Cohesive TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Metrics • Cohesive • “Intruder” is chosen significantly more frequently than by chance • Choice distribution is significantly different from the uniform distribution • Borderline cohesive • Two out of five items make up > 95% of the answers • “Intruder” is one of those two TPDL 2012, Cyprus, 24-27 September 2012

Evaluation Bounds • Upper bound • Manual annotation • 936 topics • Lower bound • 3 cohesive topics • <5% likelihood of seeing that number of cohesive topics by chance • Control data • 10 “really, totally, completely obvious” intruders used to filter participants who randomly select answers TPDL 2012, Cyprus, 24-27 September 2012

Experiment • Crowd-sourced using staff & students at Sheffield University • 700 participants • 9 clustering strategies • 30 units per strategy – total of 270 units • Results • 8840 ratings • 21 – 30 ratings per unit (median 27 ratings) TPDL 2012, Cyprus, 24-27 September 2012

Results TPDL 2012, Cyprus, 24-27 September 2012

Conclusions • K-means almost as good as the human classification • LDA is very fast and approximately two thirds of the topics are acceptably cohesive • Future work: • Make it hierarchical • Create hybrid algorithms TPDL 2012, Cyprus, 24-27 September 2012

Thank you for listening m.mhall@sheffield.ac.uk Find out more about the project: http://www.paths-project.eu The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 270082. We acknowledge the contribution of all project partners involved in PATHS (see: http://www.paths-project.eu).

Evaluating the Use of Clustering for Automatically Organising Digital Library Collections