1 / 8

How (Not) to Use a Semi-automated Clustering Tool

How (Not) to Use a Semi-automated Clustering Tool. Kat Hagedorn University of Michigan April 11, 2006. Update on UM’s efforts. Built three research portals DLF <http://www.hti.umich.edu/cgi/b/bib-idx?c=imls> MODS <http://www.hti.umich.edu/m/mods>

xarles
Download Presentation

How (Not) to Use a Semi-automated Clustering Tool

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How (Not) to Use a Semi-automated Clustering Tool Kat Hagedorn University of Michigan April 11, 2006

  2. Update on UM’s efforts • Built three research portals • DLF <http://www.hti.umich.edu/cgi/b/bib-idx?c=imls> • MODS <http://www.hti.umich.edu/m/mods> • Aquifer <http://www.hti.umich.edu/a/aquifer> • Improvements for search / display • Integration of MODS format records • Simple vs. advanced searching • Inclusion of thumbnails

  3. The need to cluster • Want to offer more than search within a generic, large corpus of data • How to partition the data? • Emory’s MetaCombine tool promising as a topical clustering agent • (Also interested in clustering by format, access restriction, OAI software used, etc.)

  4. Clustering vs. classification • Clustering is main focus • Huge amount of data • Needed a tool to “find the topic” • Preferably a disjunctive tool (placing files under more than one topic) • Classification is secondary focus • Have potential classification (UM’s browse) • Marrying to current system nigh on impossible

  5. Results: duration • First tried with small repository of ~5500 records (amnh) • Took around 25 minutes • Multiple tries with larger repository of ~270K records (dlps) • Took around 12 hours

  6. Results: cluster names • Examples of set names from clustering UM’s metadata • Good: “europe”, “mechanical”, “architecture” • Not so good: “general”, “michigan”, “build” • Favorite: “southern literari literature fine messenger” • Granted… • Only asked for 20 clusters • Didn’t cluster hierarchically

  7. Caveats • Metadata will always be difficult to cluster • Using a tool developed as a Web service, with obvious benefits • Expect necessity of mapping set names to real topical cluster names

  8. What we need • Running the tool locally, with a local WSDL instance, would save lots (and lots) of time • Better set names…does this mean a better algorithm? • Ability to cluster by any criteria, not just topic, i.e., a post-processing module • Disjunctive clustering, meaning (so as not to hog storage) filename (not file) clustering

More Related