380 likes | 574 Views
meow ::06. Kat Hagedorn. David Newman. Clustering, Classification, and Metadata Enhancement Techniques July 24, 2006. Bill Landis, ex officio. Clustering, Classification, and Metadata Enhancement Techniques on OAI Records. Preprocessing and Topic Modeling The “Browser”
E N D
meow::06 Kat Hagedorn David Newman Clustering, Classification, and Metadata Enhancement Techniques July 24, 2006 Bill Landis, ex officio Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records • Preprocessing and Topic Modeling • The “Browser” • Lessons Learned and Next Steps Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Goals • Evaluate topical/subject-based metadata enhancement • Experiment on testbed of multiple OAI repositories • Discuss lessons learned and refine testing • Propose products and services Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > What We Did vocab- ulary Cluster preprocess topic model (cluster/learn) topics OAI records Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > What We Did vocab- ulary Cluster preprocess topic model (cluster/learn) topics OAI records vocab -ulary Classify preprocess topic model (classify) 1. topics in records 2. records in topics oai rec OAI records Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > What We Did clustering is learning the topics vocab- ulary Cluster preprocess topic model (cluster/learn) topics OAI records vocab -ulary Classify preprocess topic model (classify) 1. topics in records 2. records in topics oai rec OAI records classification is using the learned topics Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Repository Selection • Mix of cultural heritage repositories? • UMich, Library of Congress, CDL, State Lib of Victoria (Aust), … • Average of 15 words per record (excl. stopwords) • Topics often specific to collection (e.g., State Lib of Victoria) • Experience with CDL’s American West project • Mix of scientific/research repositories? • CiteSeer, arXiv, PubMed, … • <description> is a reasonably reliable 200-word abstract • Average of 75 words per record • Topics more likely to span repositories • For purposes of evaluation, used (mostly) English-language repositories Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Selected Repositories* Clustering, Classification, and Metadata Enhancement Techniques on OAI Records *Repositories harvested by UMich/OAIster, June 7, 2006.
Preprocessing & Topic Modeling > Usage of Dublin Core Fields • Decided to use words from <title>, <description>, <subject> for clustering • Idiosyncrasies • CiteSeer: repeats <author> and <title> in <subject> • CiteSeer: puts citations to other IDs in <description> • arXiv: puts e.g., “Comment: 12 pages PostScript” in <description> • RePEc: no <subject>, repeats ID in <description> • etc. • Approach: Process all repositories identically, no special treatment Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Preprocessing Example <ID=oai:CiteSeerPSU:44072> <title>Reinforcement Learning: A Survey <description>This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." … <subject>Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey vocab -ulary <ID=oai:CiteSeerPSU:44072> reinforcement learning survey survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement … leslie pack kaelbling littman andrew moore reinforcement learning survey preprocess Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Stopwords and Stemming • Standard: and, the, … • Research related: research, paper, data, system, method, result, … • Repository specific: cern, citeseer, repec, Smith, … • All tokens starting with a digit: 1996, 401k, … • Produced stopword list of 500 words • Applied very simple stemming (cars car) • Note: replacing collocations improves interpretability of topics, but not quality (los angeles los_angeles) • Don’t need to find and exclude all stopwords because topic model will help find these (e.g. des, les, une, …) -- suppress after the fact Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Building Vocabulary • Preprocessed (sampled) repositories, excluded stopwords • Only kept words that occurred in more than 10 records • Result: a final vocabulary with ~ 90,000 words • Most frequent words: cell, high, energy, protein, function, algorithm, field, theory, physics, … • Resulting discussion point: When do we need to re-create the vocabulary? (When classifying, new documents will be filtered through existing vocabulary) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > • Average of 75 words per record • Bimodal because used records with abstracts and records without abstracts • Topic model isn’t adversely affected by very short records Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Computation • Clustering (Learning) D = 750,000 records W = 90,000 word vocabulary L = 75 words per record T = 500 topics iter = 500 iterations memory = 3DL + T(D+W) = 3 GByte time = D L T Iter = 3 days (3 GHz Xeon) • Classification D = 3,000,000 records total iter = 40 iterations max memory = 2 GByte max time = 5 hours (but repositories can run in parallel) Decision point: How many topics? Decision point: How many iterations? Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Broad Topical Categories • 500 topics too many to look at • Need to organize topics under broad topical categories • Cluster the clusters (automatic) • Use pre-defined categories • Classify group of keywords (manual + automatic) • Create hierarchy by hand (manual) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Broad Topical Categories vocab- ulary Cluster preprocess topic model (cluster/learn) topics OAI records Cluster the clusters topic model (cluster/learn) broad topical categories Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Preprocessing & Topic Modeling > Broad Topical Categories vocab- ulary Cluster preprocess topic model (cluster/learn) topics OAI records Cluster the clusters topic model (cluster/learn) broad topical categories vocab -ulary Classify group of keywords group of keywords preprocess topic model (classify) topics organized under broad topical categories Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records • Preprocessing and Topic Modeling • The “Browser” • Lessons Learned and Next Steps Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > The “Browser” • PHP/MySQL browser of 3 million OAI records* • Preserving transparency for this audience • Browser not meant for end users • No search, no information architecture, etc. • http://yarra.calit2.uci.edu/meow/ Clustering, Classification, and Metadata Enhancement Techniques on OAI Records *Based on 750,000 sampled records from 9 repositories, 500 topics
The Browser > The “Browser”: http://yarra.calit2.uci.edu/meow/ Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > Selected Topics: Useful • [ t201 ] learning machinetraininglearnalgorithmtaskexamplesreinforcementinductivelearnedlearnersupervisedunsupervised • [ t482 ] labor worker employment wage market labour job unemployment wagesearningpanelfindevidenceindividualparticipation • [ t381 ] algebraic geometry mathematic conjecturevarietiesprojectivevarietytheorycohomologymodulicurvesprovegenusrationalgivemath • [ t097 ] dark matter universe astrophysic cosmological cosmicbackgrounddensityinflationspectrumpowerscalecmbhalocosmologygravitational • [ t027 ] hiv virus human immunodeficiencytypeenvelopeinfectionviralcd4infectedgagreplicationreverseaidtatgp120 • [ t365 ] waste radioactive wastes tanknuclearfacilitiesmanagementhanforddisposalfuelstoragematerialprocessingfacilitysitelevel > show all 500 sub-topics (to see all 500 topics) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > Selected Topics: Less Useful • [ t255 ] journal author chapter vol noteseditorpublicationissuespecialbibliographyreaderreferencesappendixliteraturesubmittedtopic • [ t328 ] paul mark thank andrew scott stephen alan steven miller georgemartinobituariesthesisdanielprofian • [ t384 ] supported part grant authorfoundationpartiallycontractsciencenationalnsfsupportadvancedccrprovidedcenteragency • [ t112 ] look people difficult thing need want factreasonhelpunderstandthinksayalwaytryeasybad • [ t496 ] increase increased increasesdecreaseincreasingdecreaseddecreasesobservedchangedecreasingsignificantcauseddecline • [ t012 ] des les danuneestparsurpourquinoussontauxcesanalysepaycette But junk topics alleviate the need to exhaustively find stopwords; many useless words cluster as topics which can be suppressed and very useful to filter out French records Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > Broad Topical Categories (BTCs) • By clustering the clusters • worked well • mathematics, global energy resources, … • can choose desired number of broad topical categories (e.g., 25) and thresholding • By classifying groups of keywords • worked well too • Then review and manually edit • include or exclude any subtopic Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > BTCs: Clustering the clusters Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
>>> Aerospace Engineering stars (15) space (18) aeronautics (20) astronautics (20) rocket (12) shuttle (12) exploration (15) lander (3) planets (7) black holes (7) quasars (7) pulsars (7) observatories (10) air traffic (10) aircraft (15) aerospace (20) airplanes (10) airports (10) heliports (10) helicopters (10) aviation (18) FAA (7) airlines (12) flight (18) comets (10) meteorites (12) spacecraft (15) air force (7) pilots (7) jets (7) air travel (15) flying (18) The Browser > BTCs: Classifying group of keywords domain expert specifies list of relevant keywords and (importance) Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > BTCs: Classifying group of keywords >>> Aerospace Engineering [t192] (69%) vehicle flight vehicles engine car road speed nasa aircraft air [t352] (13%) star solar planet mass astrophysic binary dwarf orbital sun companion [t191] (8%) space spaces hilbert subspace dimensional subspaces defined exploration linear point >>> Dermatology [t388] (83%) infection skin disease tract respiratory fever burgdorferi caused wound arthritis [t157] (8%) cancer tumor p53 breast carcinoma survival human tumour malignant prostate [t071] (7%) growth tuberculosis mycobacterium growing grow igf factor bcg avium >>> Geology and Earth Sciences [t121] (73%) geothermal rock seismic energy mountain drilling fluid survey spring yucca [t268] (12%) sea atmospheric climate ice ocean atmosphere cloud global wind aerosol >>> Molecular, Cellular and Developmental Biology [t276] (31%) molecular biological sciences molecules biology molecule quantitative biochemistry basic [t417] (15%) cell apoptosis cellular death cultured bcl lines hela transfected mediated [t355] (12%) brain neuron neuronal cortex synaptic cortical rat nervous cerebral dopamine [t418] (9%) genes genome gene repeat chromosome sequences dna genomic sequence region [t319] (7%) mice development mouse drosophila expression transgenic cell embryonic embryos gene >>> Transportation [t192] (85%) vehicle flight vehicles engine car road speed nasa aircraft air in review, would delete this topic from this BTC just found 1 topic relevant to transportation Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > Browse Records in a Topic can navigate back to multiple BTCs nice mix of repositories Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > Browse Records in a Topic: From one repository display records just from Library of Congress Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > Sample Record Murphy's Law in algebraic geometry: Badly-behaved deformation spaces > preprocessed text murphy law algebraic geometry badly behaved deformation spaces consider question bad deformation space object answer priori reason deformation space bad moduli spaces precisely singularity finite type smooth parameter hilbert scheme curves projective space moduli spaces smooth projective type surfaces higher dimensional varieties plane curves nodes cusp stable sheaves isolated threefold singularities object pathological fact nice curves smooth surfaces ample canonical bundle stable sheaves torsion free rank singularities normal cohen macaulay justifies mumford philosophy moduli spaces behaved object arbitrarily bad priori reason construct smooth curve projective space deformation space component singularity type reduced behavior subschemes similarly give surface f_p lift course hold holomorphic category difficult compute deformation spaces directly obstruction theories circumvent relating tractable deformation spaces smooth morphism essential starting point mnev universality theorem mathematic algebraic geometry mathematic complex variables > top topics [ t381 ] algebraic geometry mathematic conjecture varieties projectivevariety theory cohomology moduli curves prove genus rational give math[ t191 ] space spaces oai:arXiv.org:math/0411469 topics for this record link to actual OAI record Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
The Browser > Repository-specific Browsers • Library of Congress (http://yarra.calit2.uci.edu/oai/loc/) • University of Michigan (http://yarra.calit2.uci.edu/oai/umich/) • University of Washington (http://yarra.calit2.uci.edu/oai/uwash/) • African Journals Online (http://yarra.calit2.uci.edu/oai/africa/) • and many more… Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Clustering, Classification, and Metadata Enhancement Techniques on OAI Records • Preprocessing and Topic Modeling • The “Browser” • Lessons Learned and Next Steps Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Lessons Learned & Next Steps > Evaluation • Topic modeling worked well • Most topics were useful • Drain on computer resources was reasonable • Human effort was relatively small • All repositories processed identically, no special treatment • Strategy worked well • Clustering, then • Classification, and • Broad Topical Categories creation Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Lessons Learned & Next Steps > Further Evaluation • Current processing only for • English-language repositories • Science/research based repositories • Need to test cultural heritage repositories and foreign-language records • Less consistent descriptive language and length • “On-the-horse” problem more prevalent • Greater need to individually process repositories • Also need usability testing to evaluate further • Depends on criteria -- who are users? • Librarians? • End-users? • Depends on products and services desired by users Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Lessons Learned & Next Steps > Discussion Point: When to Re-cluster? classify classify classify cluster classify classify cluster cluster classify • Need to re-cluster • when collection changes significantly • if there is a “hole” in topics • but NOT just because you have another repository • If you re-cluster • all topics will be different • have to discard hand-labeling • Broad Topical Categories might be different • However, classification is • “cheap” and easy • e.g., for OAIster, could re-classify every harvest…until spring clean Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Lessons Learned & Next Steps > Products and Services • Depending on users… • What kind of service is useful? • What should interface to topics look/act like? • What kind of use should we envision? • We have some ideas… Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Lessons Learned & Next Steps > Archive of Topics • Are the topics we created useful to anyone else? • Scenario: librarian uses topics/classifier for local resources • To use locally you need: • the preprocessor (i.e. the preprocessing rules) • the vocabulary (file of 90,000 words) • the topic model classifier Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
Lessons Learned & Next Steps > Subject Search/Browse for OAIster • Integrate topics into OAIster • add to records so can perform canned topic search • add to interface so can browse BTCs to records • Additionally, can allow users to find records similar to those retrieved • e.g., retrieved records on cosmology and can find similar records on astrophysics, relativity, … • How to do this? Clustering, Classification, and Metadata Enhancement Techniques on OAI Records
How To Reach Us • David Newman: University of California, Irvine <newman@uci.edu> • Kat Hagedorn: University of Michigan <khage@umich.edu> • Bill Landis: California Digital Library <bill.landis@ucop.edu> Clustering, Classification, and Metadata Enhancement Techniques on OAI Records