230 likes | 363 Views
Scaling in the Geography of US Computer Science Rui Carvalho and Michael Batty University College London rui.carvalho@ucl.ac.uk m.batty@ucl.ac.uk http://www.casa.ucl.ac.uk/ Thanks: Michael Gastner (SFI), Isaac Councill (PSU), Chris Brunsdon (Leicester), Ben Gimpert (UCL). Motivation.
E N D
Scaling in the Geography of US Computer Science Rui Carvalho and Michael Batty University College London rui.carvalho@ucl.ac.uk m.batty@ucl.ac.uk http://www.casa.ucl.ac.uk/ Thanks: Michael Gastner (SFI),Isaac Councill (PSU),Chris Brunsdon (Leicester),Ben Gimpert (UCL)
Motivation • Why Geography? • Scientists: who can I collaborate with in my city/country? • Funding Agencies: where are new research centres emerging? Is regional distribution of funds optimal? • Scientometrics: distinguish between J. Smith (PSU) and J. Smith (UCL); • Preprint server challenges: • [USA] NIH-funded investigators are required to submit to PubMed their papers within 1 year of publication (effective May 2, 2005); • [UK] Wellcome Trust-funded papers will in future have to be placed in a central public archive within six months of publication; • Data mining challenges: • Processing of large databases give promise to uncover knowledge hidden behind the mass of available data; • Dramatically speed up achievements formerly reached solely by human effort and provide new results that could not have been reached by humans unaided; • Statistical Challenges: • Conventional wisdom holds that (geographical) spatial point processes have characteristic scales... • Yet most “real world” phenomena are often far from equilibrium. PNAS, 6 April 2004
Plan • Open Archives Datasets: • Citeseer (Computer Science); • arXiv.org (mainly Physics, but also Maths and CS) • Geographical Datasets: • The US census bureau makes available on the web datasets for geocoding, but Europe lacks a unified ‘open-access’ database; • Plan: • Extract ZIP codes from authors’ addresses; • Map research centres geographically; • Questions about the research centres: • How productive are they? • Are there non-trivial spatial structures at a geographical scale?
Plan • Open Archives Datasets: • Citeseer (Computer Science); • arXiv.org (mainly Physics, but also Maths and CS) • Geographical Datasets: • The US census bureau makes available on the web datasets for geocoding, but Europe lacks a unified ‘open-access’ database; • Plan: • Extract ZIP codes from authors’ addresses; • Map research centres geographically; • Questions about the research centres: • How productive are they? • Are there non-trivial spatial structures at a geographical scale? Can Statistical Physics Help?
What is Citeseer? • Founded by Steve Lawrence and C. Lee Giles in 1997 (NEC); • Now at Penn State http://citeseer.ist.psu.edu/ • Archive of computer science research papers harvested from the web and submitted by users; • Currently (Dec 2005) contains over 730,000 documents; • Citeseer was developed as a model for Autonomous Citation Indexing, i.e. citation indexes are created automatically; • Can search content in postscript and PDF files.
Data Collecting and Parsing • Citeseer metadata: • 525,055 computer science research papers; • 399,757 (76.14%) of which are unique; • 103,172 (25.81%) of the unique papers have one or more US authors; • 2,975 different ZIP codes in the unique papers belong to the US conterminous states (48 states, plus the District of Columbia); • 5 most productive ZIP codes: • Count: 3950 Zip: 15213 Carnegie Mellon Univ, Pittsburgh PA; • Count 3403 Zip: 02139 MIT, Cambridge, MA; • Count: 2954 Zip: 94305 Stanford Univ, CA; • Count: 2691 Zip: 94720 Univ California at Berkley, CA; • Count: 2309 Zip: 61801 Univ Illinois at Urbana Champaign, IL
Cartograms Diffusion-based method for producing density-equalizing maps, Michael T. Gastner and M. E. J. Newman, Proc. Nat. Acad. Sci. USA, 101, 7499-7504 (2004) Density-equalizing map projections: Diffusion-based algorithm and applications Michael T. Gastner and M. E. J. Newman, Geocomputation 2005 (to appear)
Cartograms Diffusion-based method for producing density-equalizing maps, Michael T. Gastner and M. E. J. Newman, Proc. Nat. Acad. Sci. USA, 101, 7499-7504 (2004) Density-equalizing map projections: Diffusion-based algorithm and applications Michael T. Gastner and M. E. J. Newman, Geocomputation 2005 (to appear)
Cartograms Diffusion-based method for producing density-equalizing maps, Michael T. Gastner and M. E. J. Newman, Proc. Nat. Acad. Sci. USA, 101, 7499-7504 (2004)
Cartograms Diffusion-based method for producing density-equalizing maps, Michael T. Gastner and M. E. J. Newman, Proc. Nat. Acad. Sci. USA, 101, 7499-7504 (2004)
Spatial Point Processes • Moments: • First moment: ρ, expected number of points per unit area; • Second moment: Ripley’s function. ρK(r) is the expected number of points within distance r of a point. • For a Poisson process, ; • But neither the first or second moments give a feel for the way in which spatial distribution changes within an area.
The Two-Point Correlation Function • The two-point correlation function describes the probability to find a point in volume dV(x1) and another point in dV(x2) at distance r = |x1-x2|; • For a Poisson process g(r)=1; • Edge corrections (Ripley’s Weights): take a circle centred on point x passing through another point y. If the circle lies entirely within the domain, D, the point is counted once. If a proportion p(x,y) of the circle lies within D, the point is counted as 1/p points.
Computation of the Two-Point Correlation Function Intersection with border gives more than one polygon Geographical range at which the two-point correlation function can be approximated by a power-law
To find out more • http://www.casa.ucl.ac.uk/ • Spatially Embedded Complex Systems Engineering (SECSE): http://www.secse.net/ members: UCL, Leeds, Southampton, Sussex • rui.carvalho@ucl.ac.uk m.batty@ucl.ac.uk
Poisson Point Process • We say that a spatial process is completely random iff: • The number of events in any planar region A with area |A| follows a Poisson distribution with mean λ |A|, where λ is the density of points; • For any two disjoint regions A and B, the random variables N(A) and N(B) are independent.