1 / 30

Caltech Theses Collection Usage Analysis

Caltech Theses Collection Usage Analysis. Ed Sponsler George Porter Betsy Coles California Institute of Technology Library System. Three Kinds of Lies. White Lies Damned Lies Statistics. The Devil’s in the Data’s Details. Examinig the Data’s Details.

rafi
Download Presentation

Caltech Theses Collection Usage Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Caltech Theses Collection Usage Analysis Ed Sponsler George Porter Betsy Coles California Institute of Technology Library System

  2. Three Kinds of Lies • White Lies • Damned Lies • Statistics

  3. The Devil’s in the Data’s Details

  4. Examinig the Data’s Details • Study the data: What created it? Human? Computer? What does it mean? • WRONG: How can the data address my questions? • RIGHT: What questions can the data address?

  5. Let’s Put Some Honesty into Statistics

  6. Caltech Theses Facts • First Digital Deposit: July, 2001 • Number of Theses: 1208 • Software Used: VT ETDdb (but not for much longer) • Campus Mandate: June, 2002 • Defense Date Range: 1922 to present

  7. Caltech Theses Statistics • Data Source: Apache Web Logs • What is an access? • What can be ignored and why? • What do human v robot accesses look like? • What is a referrer? User Agent? Host IP? Requested Object?

  8. Apache Combined Log Format 63.89.199.36 - - [21/Jul/2003:12:53:01 -0700] "GET /etd/available/etd-12182002-190040/unrestricted/thesis.pdf HTTP/1.1" 200 15767 "http://etd.caltech.edu/etd/available/etd-2182002-190040/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"

  9. DeDupe The dedupe filter ensures that a host may access a thesis only one time. Duplicate attempts are ignored, even if the request is for a different file from the same thesis, such as a different Chapter.

  10. DeDupe The result of the dedupe filter is an access_log containing at most one log entry for each unique host that has accessed any file of a given thesis.

  11. DeDupe Data Structure Theses ID etd-3493 etd-1139 etd-944 Host IP 131.212.13.22 124.24.21.1 145.46.55.6 access_log 131.212.13.22 - - [21/Jul/2003:12 124.24.21.1 - - [12/Aug/2003:15 145.46.55.6 - - [05/Sep/2003:05 131.212.13.22 - - [20/Sep/2003:04 133.25.5.12 - - [28/Sep/2003:11 154.21.78.9 - - [03/Oct/2003:09 131.215.12.22 - - [05/Janl/2004:02 133.42.3.99 - - [09/Jan/2004:07 101.24.21.99 - - [14/Feb/2004:01 Host IP 131.212.13.22 133.25.5.12 154.21.78.9 Host IP 131.215.12.22 133.42.3.99 101.24.21.99

  12. DeDupe Processing

  13. Apache Status Codes

  14. User Agents

  15. User Agents Internet Explorer 60% Known Human Users 71% Netscape 11% Googlebot 14% Bots/Harvesters/Other 29% Other 15%

  16. Search Servers

  17. PDF Downloads from7/1/2001 - 5/31/2004

  18. Country of Origin Report GeoIP database contains IP blocks and their country of origin More useful and complete than top level domain names (.edu, .de, .uk, etc)

  19. United States | 76294 China | 7943 Germany | 4763 United Kingdom | 4646 Canada | 3918 India | 3328 Japan | 3271 France | 2887 Italy | 2066 Taiwan | 2063 Korea | 1639 Spain | 1300 Australia | 1249 Netherlands | 1239 Iran | 1208 Malaysia | 1160 Hong Kong | 1007 Turkey | 961 Brazil | 860 Poland | 853 Singapore | 847 Russian Fed. | 812 Switzerland | 810 Sweden | 759 Israel | 743 Belgium | 735 Mexico | 724 Thailand | 648 Egypt | 542 Greece | 511 Romania | 480 Vietnam | 455 Indonesia | 451 Portugal | 438 Finland | 419 Philippines | 418 Geographic Analysis153 countries represented

  20. Most Popular Theses Count Defense Date 3322 2000-10-23 3199 2002-08-07 3174 2002-07-16 2457 2001-10-23 2153 2002-10-02 2120 2002-09-25 2098 2001-05-18 2073 2002-10-04 1959 2002-11-05 1848 2003-01-14 1675 2002-08-14 1614 2002-05-02 Count Defense Date 1486 2002-09-04 1378 2003-09-02 1304 2001-02-09 1296 2003-05-15 1176 2003-05-15 1134 2001-05-07 1130 2002-01-16 1124 2001-03-08 1123 2003-06-02 1091 2001-01-19 1087 2003-03-20

  21. Most Popular Theses Defense Date Title (>1000 downloads) 2000-10-23 Blocking Adhesion to Cell and Tissue Surfaces via Steric Stabilization with Graft Copolymers containing Poly(Ethylene Glycol) and Phenylboronic Acid 2002-08-07 Electrochemical Sensors Based on DNA- Mediated Charge Transport Chemistry 2002-07-16 Effects of Surface Modification on Charge-Carrier Dynamics at Semiconductor Interfaces 2001-10-23 I. Seafloor Morphology of the Osbourn Trough and Kermadec Trench and II. Multiscale Dynamics of Subduction Zones 2002-10-02 I. Structure-Function Analysis of the Mechanosensitive Channel of Large Conductance. II. Design of Novel Magnetic Materials using Crystal Engineering.

  22. Most Popular Theses Defense Date Title 2002-09-25 Modeling a Hox Gene Network: Stochastic Simulation with Experimental Perturbation 2001-05-18 All-Optical Logic Circuits based on the Polarization Properties of Non-Degenerate Four- Wave Mixing 2002-10-04 Site-specific incorporation of synthetic amino acids into functioning ion channels 2002-11-05 Impact-Ionization Mass Spectrometry of Cosmic Dust 2003-01-14 Force-Detected Nuclear Magnetic Resonance Independent of Field Gradients 2002-08-14 Fast, High-Order Methods for Scattering by Inhomogeneous Media- 2002-05-02 Neural dynamics underlying complex behavior in a songbird 2002-09-04 Spectroscopic Characterization of DNA-mediated Charge Transfer

  23. Most Popular Theses Defense Date Title 2003-09-02 Protein Engineering Through in vivo Incorporation of Phenylalanine Analogs 2001-02-09 Synthesis, Passivation and Charging of Silicon Nanocrystals 2003-05-15 Sensitizer-linked substrates as probes of heme enzyme structure and catalysis 2003-05-15 Mirror Thermal Noise in Interferometric Gravitational Wave Detectors 2001-05-07 Analysis and Design of Turbo-like Codes 2002-01-16 Computational Enzyme Design 2001-03-08 An Investigation of Ion Engine Erosion by Low Energy Sputtering 2003-06-02 Laboratory Evolution of Cytochrome P450 Peroxygenase Activity 2001-01-19 Passive Hypervelocity Boundary Layer Control Using an Acoustically Absortive Surface 2003-03-20 Mapping the cytochrome c folding landscape

  24. Human / Robot Split Human activity identified by ‘MSIE’ or ‘Mozilla’ In the User Agent field of the apache_log

  25. Referrers by Human UseMSIE | Mozilla • etd.caltech.edu 33% • www.google.com 32% • search.yahoo.com 8% • www.google.de 3% • all others <2% (each) • 492 total referrers

  26. Most Active RobotsSince April, 2004 Googlebot | 3524 Googlebot/Test | 1100 TurnitinBot | 362 Wget | 252 msnbot | 162 DA | 41 Contype | 36 ia_archiver | 33 FAST-WebCrawler | 18 NPBot | 16 NetAnts | 16

  27. Summary • Keep Statistics Honest: understand and scrub your data before analysis • Google is key for discovery • Theses are popular because they are new and have useful content

  28. Next Steps • Compare download frequencies, not just totals • Create local IP -> domain name database • Adapt DeDupe to CODA EPrints Archives

  29. Caltech Library System’s Online Digital Archives Theses http://etd.caltech.edu All Archives http://coda.caltech.edu

More Related