1 / 26

A Comparison of On-line Computer Science Citation Databases

A Comparison of On-line Computer Science Citation Databases. Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk http://www.cs.ucl.ac.uk/staff/V.Petricek. Motivation. Autonomous databases have advantages compared to manually constructed

andrew
Download Presentation

A Comparison of On-line Computer Science Citation Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparison of On-line Computer ScienceCitation Databases Vaclav Petricek, Ingemar J. Cox, Hui Han, Isaac G. Councill, C. Lee Giles v.petricek@cs.ucl.ac.uk http://www.cs.ucl.ac.uk/staff/V.Petricek

  2. Motivation • Autonomous databases have advantages compared to manually constructed • Easier maintenance • Lower cost • Is it really an equivalent solution that is just cheaper? • Does the automated acquisition introduce any bias? 2

  3. Talk Overview • Datasets • Acquisition bias and models • CS Citation Distribution • Conclusions • Future Work 3

  4. Datasets - DBLP • DBLP was operated by Micheal Ley since 1994 [8]. It currently contains over 550,000 computer science references from around 368,000 authors. • Each entry is manually inserted by a group of volunteers and occasionally hired students. The entries are obtained from conference proceeding and journals. 4

  5. Datasets - CiteSeer • CiteSeer was created by Steve Lawrence and C. Lee Giles in 1997. It currently contains over 716,797 documents. • In contrast, each entry in CiteSeer is automatically entered from an analysis of documents found on the Web. 5

  6. Datasets – Publication year • CiteSeer • DBLP • Declining CiteSeer maintenance • Increased DBLP funding 6

  7. Author bias • CiteSeer papers have higher average number of authors • Both databases show growing team sizes 7

  8. Author bias • Crossover for low number of authors • CiteSeer has higher proportion of multiauthor papers than DBLP (for number of authors <4) 8

  9. Author bias “Papers with higher number of authors are more likely to be included in CiteSeer” Hypothesis Crawler suffers from acquisition bias due to • Submission • Crawling 9

  10. Models - CiteSeer • CiteSeer Submission model Probability of a document being submitted grows with number of authors • Publication submitted with probability β • Probabilities independent for coauthors citeseers(i) = (1-(1- β )i) * all(i) 10

  11. Models - CiteSeer • CiteSeer crawler model • Probability of crawling a document grows with number of its online copies • Probability of a document being online grows with number of authors • Probabilities independent between authors • Publication published online with probability δ • Publication found by crawler with probability γ citeseerc(i) = (1-(1- γδ)i) * all(i) • Both models result in equivalent type of bias 11

  12. Coverage • Can we estimate the coverage of dblp? • Can we estimate the coverage of CiteSeer? • Can we estimate the coverage of CS literature? • We need a model of DBLP acquisition method 12

  13. Models - DBLP • DBLP model • Publication included in DBLP with probability α • α is a parameter reflecting DBLP “coverage” of CS literature dblp(i) = α * all(i) 13

  14. Coverage citeseer(i) = (1-(1- β )^i) * all(i) dblp(i) = α * all(i) r(i) = dblp(i) / citeseer(i) r(i) = α / (1-(1- β )^i) 14

  15. r(i) = α / (1-(1- β )^i) Alpha ~ 0.3 DBLP covers approx 30% of CS literature CiteSeer covers approx 40% CS literature ~ 2M publications Results 15

  16. Citation distribution

  17. Citation distribution • Studied before • Follow a power-law • Redner, Laherrere et al, Lehmann and others • Mostly physics community • We use a subset of CiteSeer and DBLP papers that have citation information 17

  18. Citation distribution • Power law • Sparse data for high number of citations 18

  19. Citation distribution Exponential binning • Data aggregated in exponentially increasing ‘bins’ • Equivalent to constant bins on a logarithmic scale • Easier interpolation 19

  20. Citation distribution • Distribution of citations more uneven in CS than in Physics • Significant differences between DBLP and CiteSeer 20

  21. Citation distribution • CiteSeer contains fewer low cited papers than DBLP • No model yet • Lawrence • “Online or invisible?” 21

  22. Conclusions - authors • CiteSeer and DBLP have very different acquisition methods • Significant bias against papers with low number of authors (less than 4) in CiteSeer. • Single author papers appear to be disadvantaged with regard to the CiteSeer acquisition method. • two probabilistic models for paper acquisition in CiteSeer resulting in the same type of bias • Crawler model • Submission model 22

  23. Conclusions - coverage • Simple model of DBLP coverage predicts coverage of approx 30% of the entire Computer Science literature. • This gives us CiteSeer coverage of approx 40% and total number of CS papers around 2M 23

  24. Conclusions - citations • CiteSeer and DBLP citation distributions are different • Both indicate that highly cited papers in Computer Science receive a larger citation share than in Physics. • CiteSeer contains fewer low cited papers 24

  25. Future Work • Repeat experiments on most recent CiteSeer data • Other methods to estimate Computer science literature size and trends • Overlap of CiteSeer and DBLP • Bias introduced by bibliography parsing • Collaborative network analysis • Connection to internet surveys? 25

  26. Thank you

More Related