How (Not) to Use a Semi-automated Clustering Tool

How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006

Update on UM’s efforts • Built three research portals • DLF <http://www.hti.umich.edu/cgi/b/bib-idx?c=imls> • MODS <http://www.hti.umich.edu/m/mods> • Aquifer <http://www.hti.umich.edu/a/aquifer> • Improvements for search / display • Integration of MODS format records • Simple vs. advanced searching • Inclusion of thumbnails

The need to cluster • Want to offer more than search within a generic, large corpus of data • How to partition the data? • Emory’s MetaCombine tool promising as a topical clustering agent • (Also interested in clustering by format, access restriction, OAI software used, etc.)

Clustering vs. classification • Clustering is main focus • Huge amount of data • Needed a tool to “find the topic” • Preferably a disjunctive tool (placing files under more than one topic) • Classification is secondary focus • Have potential classification (UM’s browse) • Marrying to current system nigh on impossible

Results: duration • First tried with small repository of ~5500 records (amnh) • Took around 25 minutes • Multiple tries with larger repository of ~270K records (dlps) • Took around 12 hours

Results: cluster names • Examples of set names from clustering UM’s metadata • Good: “europe”, “mechanical”, “architecture” • Not so good: “general”, “michigan”, “build” • Favorite: “southern literari literature fine messenger” • Granted… • Only asked for 20 clusters • Didn’t cluster hierarchically

Caveats • Metadata will always be difficult to cluster • Using a tool developed as a Web service, with obvious benefits • Expect necessity of mapping set names to real topical cluster names

What we need • Running the tool locally, with a local WSDL instance, would save lots (and lots) of time • Better set names…does this mean a better algorithm? • Ability to cluster by any criteria, not just topic, i.e., a post-processing module • Disjunctive clustering, meaning (so as not to hog storage) filename (not file) clustering

How (Not) to Use a Semi-automated Clustering Tool