120 likes | 257 Views
The Geography of arXiv.org Rui Carvalho and Michael Batty University College London rui.carvalho@ucl.ac.uk m.batty@ucl.ac.uk http://www.casa.ucl.ac.uk/secse/. What is arXiv.org?. Founded by Paul Ginsparg in ‘91 at LANL, moved to Cornell in ‘01;
E N D
The Geography of arXiv.org Rui Carvalho and Michael Batty University College London rui.carvalho@ucl.ac.uk m.batty@ucl.ac.uk http://www.casa.ucl.ac.uk/secse/
What is arXiv.org? • Founded by Paul Ginsparg in ‘91 at LANL, moved to Cornell in ‘01; • Self-archive of physics, maths and computer science preprints since ‘91; • Quantitative biology added Sep ‘03; • Papers have a time stamp, so authors can claim ownership; • Typically, papers appear in refereed journals about 12 months after journal submission; • Some data for calendar year ‘04: • total number of submissions (Aug ’91 through Dec ’04): 303 614 • average submission rate (’04): 3644 papers/month • 18 mirror-sites in 16 countries; • submission rates (’04): hep 20.5%, cond-mat 20.5%, astro-ph 18.9%, math 11.8 %, quant-ph 4.8%, gr-qc 4.3%, nucl 3.9%, physics(other) 3.1%, nlin 2.3%, cs 1.5%, q-bio 0.2%; • submissions by country (’00-’04): US edu and gov (27.5%), Germany (9.9%), Italy (6.3%), United Kingdom (5.8%), Japan (5.7%), France (5.6%), Russian Federation (3.2%);
arXiv monthly submission rate stats (Dec ’04)“hep” = High Energy Physics, “cond-mat” = Condensed Matter Physics, “astro-ph” = Astrophysics, cross-listings in clear
Why study the Geography of arXiv.org? • Papers often submitted in LaTeX. LaTeX is a text-based document preparation system for high-quality typesetting (it’s not a word processor!); • In that case, LaTeX source code available for download from arXiv.org; • Typically (but not always!), LaTeX source encodes author and address data in specific fields; • These fields can be parsed using custom scripts (e.g. written in Perl) to extract the geographical location of the authors; • Problem: can we parse author/address fields, extract papers with one or more US authors, and map the zip codes in their addresses?
Problems with Zip Code extraction • Identifying zip code look-alikes: • Easy: • Kiev 03028, Ukraine • Roma 00185, Italy • Not so easy: • Iran 71454 • Israel 84105 • Could not process: • Physics Department, Northeastern University, Boston MA USA • address/author fields not found (as in PhD thesis or commentaries) • Errors (found 6 in a random sample of 400 papers (1.5%)) • Fargo 58105, ND • Theoretical Division and Center for Nonlinear Studyes, Los Alamos, New Mexico~87545 • Zip not in database (found 1 in 400)
Mapping cond-mat in 2004Total: 7957; one or more US authors: 2326 (29.2%); couldn’t process: 517 (6.5%)
Next Steps • Extend study to larger sample of arXiv.org; • Study spatial dynamics of arXiv papers for the period ’91—’05 (knowledge diffusion?); • Compare with NSF, ARPA, etc data by state; • Extract geography of collaboration networks.
To find out more • http://www.casa.ucl.ac.uk/secse/ • Spatially Embedded Complex Systems Engineering (SECSE): http://www.secse.net/ members: UCL, Leeds, Southampton, Sussex • rui.carvalho@ucl.ac.uk m.batty@ucl.ac.uk