260 likes | 389 Views
A Comparison of On-line Computer Science Citation Databases. Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk http://www.cs.ucl.ac.uk/staff/V.Petricek. Motivation. Autonomous databases have advantages compared to manually constructed
E N D
A Comparison of On-line Computer ScienceCitation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk http://www.cs.ucl.ac.uk/staff/V.Petricek
Motivation • Autonomous databases have advantages compared to manually constructed • Easier maintenance • Lower cost • Is it really an equivalent solution that is just cheaper? • Does the automated acquisition introduce any bias? 2
Talk Overview • Datasets • Acquisition bias and models • CS Citation Distribution • Conclusions • Future Work 3
Datasets - DBLP • DBLP was operated by Micheal Ley since 1994 [8]. It currently contains over 550,000 computer science references from around 368,000 authors. • Each entry is manually inserted by a group of volunteers and occasionally hired students. The entries are obtained from conference proceeding and journals. 4
Datasets - CiteSeer • CiteSeer was created by Steve Lawrence and C. Lee Giles in 1997. It currently contains over 716,797 documents. • In contrast, each entry in CiteSeer is automatically entered from an analysis of documents found on the Web. 5
Datasets – Publication year • CiteSeer • DBLP • Declining CiteSeer maintenance • Increased DBLP funding 6
Author bias • CiteSeer papers have higher average number of authors • Both databases show growing team sizes 7
Author bias • Crossover for low number of authors • CiteSeer has higher proportion of multiauthor papers than DBLP (for number of authors <4) 8
Author bias “Papers with higher number of authors are more likely to be included in CiteSeer” Hypothesis Crawler suffers from acquisition bias due to • Submission • Crawling 9
Models - CiteSeer • CiteSeer Submission model Probability of a document being submitted grows with number of authors • Publication submitted with probability β • Probabilities independent for coauthors citeseers(i) = (1-(1- β )i) * all(i) 10
Models - CiteSeer • CiteSeer crawler model • Probability of crawling a document grows with number of its online copies • Probability of a document being online grows with number of authors • Probabilities independent between authors • Publication published online with probability δ • Publication found by crawler with probability γ citeseerc(i) = (1-(1- γδ)i) * all(i) • Both models result in equivalent type of bias 11
Coverage • Can we estimate the coverage of dblp? • Can we estimate the coverage of CiteSeer? • Can we estimate the coverage of CS literature? • We need a model of DBLP acquisition method 12
Models - DBLP • DBLP model • Publication included in DBLP with probability α • α is a parameter reflecting DBLP “coverage” of CS literature dblp(i) = α * all(i) 13
Coverage citeseer(i) = (1-(1- β )^i) * all(i) dblp(i) = α * all(i) r(i) = dblp(i) / citeseer(i) r(i) = α / (1-(1- β )^i) 14
r(i) = α / (1-(1- β )^i) Alpha ~ 0.3 DBLP covers approx 30% of CS literature CiteSeer covers approx 40% CS literature ~ 2M publications Results 15
Citation distribution • Studied before • Follow a power-law • Redner, Laherrere et al, Lehmann and others • Mostly physics community • We use a subset of CiteSeer and DBLP papers that have citation information 17
Citation distribution • Power law • Sparse data for high number of citations 18
Citation distribution Exponential binning • Data aggregated in exponentially increasing ‘bins’ • Equivalent to constant bins on a logarithmic scale • Easier interpolation 19
Citation distribution • Distribution of citations more uneven in CS than in Physics • Significant differences between DBLP and CiteSeer 20
Citation distribution • CiteSeer contains fewer low cited papers than DBLP • No model yet • Lawrence • “Online or invisible?” 21
Conclusions - authors • CiteSeer and DBLP have very different acquisition methods • Significant bias against papers with low number of authors (less than 4) in CiteSeer. • Single author papers appear to be disadvantaged with regard to the CiteSeer acquisition method. • two probabilistic models for paper acquisition in CiteSeer resulting in the same type of bias • Crawler model • Submission model 22
Conclusions - coverage • Simple model of DBLP coverage predicts coverage of approx 30% of the entire Computer Science literature. • This gives us CiteSeer coverage of approx 40% and total number of CS papers around 2M 23
Conclusions - citations • CiteSeer and DBLP citation distributions are different • Both indicate that highly cited papers in Computer Science receive a larger citation share than in Physics. • CiteSeer contains fewer low cited papers 24
Future Work • Repeat experiments on most recent CiteSeer data • Other methods to estimate Computer science literature size and trends • Overlap of CiteSeer and DBLP • Bias introduced by bibliography parsing • Collaborative network analysis • Connection to internet surveys? 25