A Comparison of On-line Computer Science Citation Databases

A Comparison of On-line Computer ScienceCitation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk http://www.cs.ucl.ac.uk/staff/V.Petricek

Motivation • Autonomous databases have advantages compared to manually constructed • Easier maintenance • Lower cost • Is it really an equivalent solution that is just cheaper? • Does the automated acquisition introduce any bias? 2

Talk Overview • Datasets • Acquisition bias and models • CS Citation Distribution • Conclusions • Future Work 3

Datasets - DBLP • DBLP was operated by Micheal Ley since 1994 [8]. It currently contains over 550,000 computer science references from around 368,000 authors. • Each entry is manually inserted by a group of volunteers and occasionally hired students. The entries are obtained from conference proceeding and journals. 4

Datasets - CiteSeer • CiteSeer was created by Steve Lawrence and C. Lee Giles in 1997. It currently contains over 716,797 documents. • In contrast, each entry in CiteSeer is automatically entered from an analysis of documents found on the Web. 5

Datasets – Publication year • CiteSeer • DBLP • Declining CiteSeer maintenance • Increased DBLP funding 6

Author bias • CiteSeer papers have higher average number of authors • Both databases show growing team sizes 7

Author bias • Crossover for low number of authors • CiteSeer has higher proportion of multiauthor papers than DBLP (for number of authors <4) 8

Author bias “Papers with higher number of authors are more likely to be included in CiteSeer” Hypothesis Crawler suffers from acquisition bias due to • Submission • Crawling 9

Models - CiteSeer • CiteSeer Submission model Probability of a document being submitted grows with number of authors • Publication submitted with probability β • Probabilities independent for coauthors citeseers(i) = (1-(1- β )i) * all(i) 10

Models - CiteSeer • CiteSeer crawler model • Probability of crawling a document grows with number of its online copies • Probability of a document being online grows with number of authors • Probabilities independent between authors • Publication published online with probability δ • Publication found by crawler with probability γ citeseerc(i) = (1-(1- γδ)i) * all(i) • Both models result in equivalent type of bias 11

Coverage • Can we estimate the coverage of dblp? • Can we estimate the coverage of CiteSeer? • Can we estimate the coverage of CS literature? • We need a model of DBLP acquisition method 12

Models - DBLP • DBLP model • Publication included in DBLP with probability α • α is a parameter reflecting DBLP “coverage” of CS literature dblp(i) = α * all(i) 13

Coverage citeseer(i) = (1-(1- β )^i) * all(i) dblp(i) = α * all(i) r(i) = dblp(i) / citeseer(i) r(i) = α / (1-(1- β )^i) 14

r(i) = α / (1-(1- β )^i) Alpha ~ 0.3 DBLP covers approx 30% of CS literature CiteSeer covers approx 40% CS literature ~ 2M publications Results 15

Citation distribution

Citation distribution • Studied before • Follow a power-law • Redner, Laherrere et al, Lehmann and others • Mostly physics community • We use a subset of CiteSeer and DBLP papers that have citation information 17

Citation distribution • Power law • Sparse data for high number of citations 18

Citation distribution Exponential binning • Data aggregated in exponentially increasing ‘bins’ • Equivalent to constant bins on a logarithmic scale • Easier interpolation 19

Citation distribution • Distribution of citations more uneven in CS than in Physics • Significant differences between DBLP and CiteSeer 20

Citation distribution • CiteSeer contains fewer low cited papers than DBLP • No model yet • Lawrence • “Online or invisible?” 21

Conclusions - authors • CiteSeer and DBLP have very different acquisition methods • Significant bias against papers with low number of authors (less than 4) in CiteSeer. • Single author papers appear to be disadvantaged with regard to the CiteSeer acquisition method. • two probabilistic models for paper acquisition in CiteSeer resulting in the same type of bias • Crawler model • Submission model 22

Conclusions - coverage • Simple model of DBLP coverage predicts coverage of approx 30% of the entire Computer Science literature. • This gives us CiteSeer coverage of approx 40% and total number of CS papers around 2M 23

Conclusions - citations • CiteSeer and DBLP citation distributions are different • Both indicate that highly cited papers in Computer Science receive a larger citation share than in Physics. • CiteSeer contains fewer low cited papers 24

Future Work • Repeat experiments on most recent CiteSeer data • Other methods to estimate Computer science literature size and trends • Overlap of CiteSeer and DBLP • Bias introduced by bibliography parsing • Collaborative network analysis • Connection to internet surveys? 25

Thank you

A Comparison of On-line Computer Science Citation Databases

A Comparison of On-line Computer Science Citation Databases

Presentation Transcript

A Comparison of On-Line and Classroom Learning

Computer Science 101 Web Access to Databases

Citation Sources Web of Science BIOSIS Citation Index Chinese Science Citation Database

A Comparison of AirNow and AQS Particulate Matter Databases

A PHILOSOPHY OF COMPUTER SCIENCE

A comparison of databases:

Computer Science 101 Web Access to Databases

Comparison of Line Removal Techniques

Computer Science 101 Web Access to Databases

Computer Science 101 Web Access to Databases

Computer Science 101 A Survey of Computer Science

Computer Science 101 A Survey of Computer Science

Comparison of Citation Discovery Methods

A Comparison of SQL and NoSQL Databases

Citation Indexing ISI Web of Science and Journal Citation Reports

A Comparison between Relational Databases and NoSQL Databases

CITATION SOURCES Web of Science BIOSIS Citation Index Chinese Science Citation Database

A Comparison of SQL and NoSQL Databases