1 / 41

A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach. Presentation to: International symposium on XML for the long haul (Balisage 2010 pre-conference) By Laine G.M. Ruus, Librarian emeritus, University of Toronto 2010-08-02

avani
Download Presentation

A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach Presentation to: International symposium on XML for the long haul (Balisage 2010 pre-conference) By Laine G.M. Ruus, Librarian emeritus, University of Toronto 2010-08-02 http://www.chass.utoronto.ca/~laine/misc/balisage2010.ppt

  2. Overview: • What are data (that is, quantitative social science data)? • History of social science quantitative data and metadata • Lessons learned

  3. What are data?

  4. Data are… • Representations of selected characteristics of a population of entities, eg individuals, companies, periods of time, etc • Characteristics are grouped, and variations of a characteristic are assigned (normally) numeric values • Assigning numeric values to variations of a characteristic allows their manipulation by mathematical/statistical procedures

  5. wisdomknowledgeinformation (statistics)data

  6. Data and statistics are not equals • Statistics are two kinds: • Descriptive statistics: summaries of common characteristics of the raw data units (one-way tables, two-way tables … multi-way tables) • Inferential statistics: measure strength and direction of relationships among characteristics of raw data units

  7. Of course, statistics (descriptive or inferential) can become data in their turn, and used in other statistical procedures.

  8. Data and statistics are not equals (cont’d) • Ie, data are: • the raw materials from which statistics are generated • ideally, available at the level at which the data were originally collected (=microdata) • need to be manipulated with statistical software in order to be comprehensible

  9. Data

  10. raw data

  11. Metadata

  12. record layout

  13. variable description (aka data dictionary)

  14. province gender

  15. syntax file for SPSS

  16. Metadata are… • Instructions to explain the content and coding of a data set (whether numeric, alphabetic, or other), and aid in their correct interpretation • Can be intended for human or computer consumption, but are ideally both

  17. Raw data + a syntax file, processed through a statistical software package results in a system file – average shelf life less than 10 years

  18. The beginnings • Hollerith cards first used to process the 1890 US census of population • By 1930s, public opinion polling was being used to eg predict electoral outcomes • 1936 Literary Digest poll predicted defeat of Roosevelt in the US presidential election • Data gathering make-work projects in the 30s in the US, such as economic censuses, surveys on unemployment, crop production, etc

  19. By the 1940s • Polling and survey taking matured • Beginnings of improved sampling methods, such a Gallup’s quota samples • 1948 polls chose Dewey over Truman in the US presidential election, leading to formation of a committee to determine why the error • the Roper Center was created, the first data archive (1946) • Data stored on punched cards, and analyzed using card sorters and similar equipment • And metadata usually looked like this…

  20. The metadata for the May 1945 Canadian Gallup Poll…..

  21. The 1950s… • UNIVAC 1, the first alphanumeric computer • UNIVAC 2 correctly predicted the Eisenhower sweep in the 1952 US presidential election • MIT began working on keyboard entry • Development of the COBOL compiler and Fortran • Magnetic tapes, at 200 bpi, could store the contents of 70,000 punched cards, ie about 5.6 megabytes of data • Lucci & Rokkan promoted the idea of data management by libraries

  22. But the metadata for the August 1958 Canadian Gallup poll still looked like this…………

  23. 1960s… • Development of Basic, the Unix operating system, and ASCII which allowed interchange of data among different computers • Statistical software packages: DATA-TEXT, SPSS, P-STAT, BIOMED, NUCROS, SAS • Magnetic tapes moved from 556 to 800 bpi • Most social scientists were still writing own local software, or using card-sorters and calculators to produce cross-tabulations and compute chi-squares

  24. 1970s a watershed decade… • Microprocessors, and 8” and later 5-1/4” diskettes • Wang word processor, Ataris, Apple 1 and the Commodore PET • dBASE, VISICALC and WORD STAR • ARPANET and expansion of time-sharing and online systems • Online bibliographic services such as Dialog, BRS, and Orbit

  25. 1970s (cont’d) • David Nasatir wrote first manual on data management under aegis of UNESCO (1972) • Mid-decade saw the creation of IASSIST, and the first training at ICPSR for data librarians • US census of population 1970 partly disseminated on computer tapes instead of print, forcing libraries to consider this new medium

  26. 1970s (cont’d) • OSIRIS software developed at University of Michigan, included statistical capabilities as well as outstanding data and metadata management • NSF funded the National Conference on Cataloging and Information Services for Machine-Readable Data Files at Airlie House in Virginia • US Department of Justice funded the project which resulted in Roistacher’s Style manual for machine-readable data files – bibliographic identity, methodology, and data dictionary

  27. An OSIRIS codebook generally followed the Roistacher recommendations. The record layout and data dictionary portion looked like this:

  28. 1980s • Supercomputers and NSFNET changed face of large scale computing, and PCs and MACs did the same for small scale computing • BITNET, followed by the Internet, provided e-mail, listservs and remote login • tape cartridges held the equivalent of 8 million cards or four times that of a 6250 tape. Five megabyte hard drives became available for microcomputers • IBM brought microcomputing to the academic sector • CD-ROMS, and the Quadra directory of databases

  29. 1980s (cont’d) • Sue Dodd’s Cataloging machine-readable data files : an interpretive manual, 1982 • Social forces one of the first journals to include guidelines on citing machine-readable data files • Population index the first bibliographic journal to cite data files • A draft revision of AACR2 chapter 9 (renamed: Computer Files) was published in 1987 – bibliographic control for data files

  30. 1990s • Migration from IBM mainframes (EBCDIC) to Unix (ASCII) • Demise of tapes for storage, in favour of widespread use of CD-ROM • Statistics Canada makes the electronic products from census the primary product • Gopher, developed in 1991, was replaced by the WWW and html, and by 1996 there are about 100,000 web servers • Beginning of the DDI (Data documentation initiative) project in 1995, published its first DTD in 1996

  31. Three major developments lead up to DDI: • OSIRIS’ metadata management capability • Roistacher ‘s outline of machine-readable data file documentation (1980) • Dodd’s cataloguing manual (1982)

  32. OSIRIS metadata • OSIRIS dictionary provided structural information: location, size, missing data, a variable name and a variable label (brief) • OSIRIS codebook provided a tagged format: • Introduction (unstructured) • Full question text • Variable values and value labels • Variable-level comments • North American institutions standardized on the OSIRIS type-1 and type-4 codebooks, Europe on the type-3 format codebook

  33. Roistacher’s style manual • Provided outline of the information that should be contained in the full metadata (aka codebook), including • Bibliographic identity • Project history • File processing summary • Data dictionary contents • Recommended appendices

  34. Sue Dodd’s cataloguing manual • Further refined the bibliographic identity component of the metadata • Provided a cross-walk to AACR cataloguing rules • Provided the foundation for the development of a MARC record • Dodd also defined the components of a bibliographic citation

  35. Many kinds of metadata for many purposes • Data collection • Data interpretation • Data preservation • Data discovery • Coding standardization

  36. Based on the NISO metadata classification:

  37. DDI provides a format … • From which other subtypes of metadata (bibliographic records, syntax files, question banks, etc) can be generated • Describes not just microdata but also an intelligent means of describing aggregate statistics as data • Can incorporate all documentation from original project conception to edition management and post-processing

  38. DDI provides a format … (cont’d) • 3rd generation data access tools (Nesstar, DDI, and Dataverse (VDC)) all support DDI 2.0 at present and provide a useful way to provide on-line remote distributed access to data discovery and data • Leads to proliferation of new applications of metadata and realization of initiatives from earlier decades

  39. Lessons learned • Three killers of data: • Software dependence • Lost metadata • Physical medium on which data are stored • No solution as yet combines data, full metadata and statistical capability in a non-software dependant format

More Related