80 likes | 218 Views
A Measure of Similarity Between Pairs of Papers. Susan Biancani Stanford University School of Education. Introduction. Long-term goal: Understand changes in scholarly ideas over time Develop a person-person similarity measure, to reflect similarity in bodies of work Short-term goal:
E N D
A Measure of Similarity Between Pairs of Papers Susan BiancaniStanford University School of Education
Introduction • Long-term goal: • Understand changes in scholarly ideas over time • Develop a person-person similarity measure, to reflect similarity in bodies of work • Short-term goal: • Develop a measure of paper-paper similarity • 9 features, including metadata and content • Train on 120 papers, rated by experts on a 1-7 scale
Data • 66,000 papers written by professors at Stanford, from the ISI database • Features for each pair of papers: • Cosine similarity of abstract tf-idf vectors; cosine similarity of title tf-idf vectors • Cosine similarity of LDA vectors (3 versions) • Count of common references • Count of journals referenced in common • Count of authors referenced in common • Dummy indicating whether the two papers were published in the same journal or not
Gold Standard Data 31 papers from 8 professors in Sociology 44 papers from 7 professors in Biology 45 papers from 7 professors in CS Rating Scale:
Training & Validation Regression model: rating = β1tfidfAbstract + β2tfidfTitle + β3lda50 + β4lda100 + β5lda200 + β6cites + β7citeJournals + β8citeAuthors + β9sameJournal Ordinal Logistic Regression to learn optimal weights for features Ten-fold cross validation (comparing predicted rating to actual)
Future Directions • Improve ratings set. • Add more disciplines • Confirm ratings with more experts • Develop a person-person distance measure, treating each person as the cluster of their papers • Apply this measure to the study of paradigm shifts / scientific-intellectual movements • Explore the role of organizational structure in these movements