Clustering More than Two Million Biomedical Publications

Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE 6(3): e18029

Motivation • Compare different similarity measurements • Make use of biomedical data set • Process large corpus

Procedures • define a corpus of documents • extract and pre-process the relevant textual information from the corpus • calculate pairwise document-document similarities using nine different similarity approaches • create similarity matrices keeping only the top-n similarities per document • cluster the documents based on this similarity matrix • assess each cluster solution using coherence and concentration metrics

Data • To build a corpus with titles, abstracts, MeSH terms, and reference lists • Matched and combined data from the MEDLINE and Scopus (Elsevier) databases • The resulting set was then limited to those documents published from 2004-2008 that contained abstracts, at least five MeSH terms, and at least five references in their bibliographies • resulting in a corpus comprised of 2,153,769 unique scientific documents • Base matrix: word-document co-occurrence matrix

Methods

tf-idf • The tf–idf weight (term frequency–inverse document frequency) • Astatistical measure used to evaluate how important a word is to a document in a collection or corpus • The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

tf-idf

LSA • Latent semantic analysis

LSA

BM25 • Okapi BM25 • A ranking function that is widely used by search engines to rank matching documents according to their relevance to a query

BM25

SOM • Self-organizing map • A form of artificial neural network that generates a low-dimensional geometric model from high-dimensional data • SOM may be considered a nonlinear generalization of Principal components analysis (PCA).

SOM • Randomize the map's nodes' weight vectors • Grab an input vector • Traverse each node in the map • Use Euclidean distance formula to find similarity between the input vector and the map's node's weight vector • Track the node that produces the smallest distance (this node is the best matching unit, BMU) • Update the nodes in the neighbourhood of BMU by pulling them closer to the input vector • Wv(t + 1) = Wv(t) + Θ(t)α(t)(D(t) - Wv(t)) • Increase t and repeat from 2 while t < λ

Topic modeling • Three separate Gibbs-sampled topic models were learned at the following topic resolutions: T= 500, T= 1000 and T=2000 topics. • Dirichlet prior hyperparameter settings of b= 0.01 and a = 0.05N/(D.T) were used, where N is the total number of word tokens, D is the number of documents and T is the number of topics.

Topic modeling

PMRA • The PMRA ranking measure is used to calculate ‘Related Articles’ in the PubMedinterface • The de facto standard • Proxy

Similarity filtering • Reduce matrix size • Generate a top-n similarity file from each of the larger similarity matrices • n=15, each document thus contributes between 5 and 15 edges to the similarity file

Clustering • DrL (now called OpenOrd) • A graph layout algorithm that calculates an (x,y) position for each document in a collection using an input set of weighted edges • http://gephi.org/

Evaluation • Textual coherence (Jensen-Shannon divergence)

Evaluation • Concentration: a metric based on grant acknowledgements from MEDLINE, using a grant-to-article linkage dataset from a previous study

Results

Results (cont.)

Clustering More than Two Million Biomedical Publications

Clustering More than Two Million Biomedical Publications

Presentation Transcript

(More than . . . , less than . . . )

More than

Finding multiwords of more than two words

Who knows more than two languages ? Who knows more than three languages ?

More on Clustering

15.8 Algorithms using more than two passes

Biomedical Diagnostics Two

Established in 1976 Built more than 225,000 homes worldwide Housed more than 1 million people

Completely Randomized Factorial With More Than Two Factors

More Clustering

More Than Words

more than

More than two groups: ANOVA and Chi-square

Chapter 4: More Than Two

More About Clustering

Statistical Inference for more than two groups

Inferences: Tests with More than Two Groups

Elections with More Than Two Candidates

More Clustering

4 Cool Classics Worth More than a Million Dollars

More than two groups: ANOVA and Chi-square

more than vs “is more than”