130 likes | 225 Views
Document Classification via Term Distribution Similarity. Xin Lu, Angela Zoss CSCI B651 Final Project Presentation December 16, 2010. Classification of Scientific Disciplines. Goals Classification using Map of Science clusters Program Matrix population Matrix transformation
E N D
Document Classification via Term Distribution Similarity Xin Lu, Angela Zoss CSCI B651 Final Project Presentation December 16, 2010
Classification of Scientific Disciplines • Goals • Classification using Map of Science clusters • Program • Matrix population • Matrix transformation • Vector comparison • Evaluation • By subdiscipline • By discipline
Classification Using Map of Science • Map of Science can be generated from documents automatically (citation analysis, LSA) • Resulting clusters are up-to-date, interlinked • The 554 clusters identified offer a more fine-grained classification system than many others (e.g., ISI uses 173 Subject Categories)
UCSD Map of Science • 7.2 million papers • 16,000 (serial) publication sources • 554 clusters of sources (subdisciplines) • 13 top-level disciplines Classification to clusters can be done using the 16k journals or 72k keywords that were extracted from article titles during the clustering process.
Article Mapping by Journal Journals Papers
Article Mapping by Keyword Articles Cosine Similarity Keyword Vectors
Program: Matrix Population • Obtained 200k articles from PubMed Central • Selected a subset of 8k articles that covered 214 map clusters, by journal name association • Used the ~30k keywords matched to those 214 clusters to generate a word frequency matrix for the 8k papers • A matrix with identical structure was created for the 214 clusters, but the values used were the match percentages assigned by the MoS
Program: Matrix Transformation Selecting terms based on inverse document frequency
Program: Vector Comparison • Cosine similarity using Matlab(http://en.wikipedia.org/wiki/Cosine_similarity)
Evaluation by Subdiscipline • Overall recall: 20%
Evaluation by Discipline • Overall recall:49%
Possible Areas for Improvement • Reduce sparseness with more flexible reg-ex, synonymy/hypernymy • SVD to smooth vectors • Different TF-IDF ranges