410 likes | 539 Views
A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach. Presentation to: International symposium on XML for the long haul (Balisage 2010 pre-conference) By Laine G.M. Ruus, Librarian emeritus, University of Toronto 2010-08-02
E N D
A brief history of markup of social science data: from punched cards to the ‘life cycle’ approach Presentation to: International symposium on XML for the long haul (Balisage 2010 pre-conference) By Laine G.M. Ruus, Librarian emeritus, University of Toronto 2010-08-02 http://www.chass.utoronto.ca/~laine/misc/balisage2010.ppt
Overview: • What are data (that is, quantitative social science data)? • History of social science quantitative data and metadata • Lessons learned
Data are… • Representations of selected characteristics of a population of entities, eg individuals, companies, periods of time, etc • Characteristics are grouped, and variations of a characteristic are assigned (normally) numeric values • Assigning numeric values to variations of a characteristic allows their manipulation by mathematical/statistical procedures
Data and statistics are not equals • Statistics are two kinds: • Descriptive statistics: summaries of common characteristics of the raw data units (one-way tables, two-way tables … multi-way tables) • Inferential statistics: measure strength and direction of relationships among characteristics of raw data units
Of course, statistics (descriptive or inferential) can become data in their turn, and used in other statistical procedures.
Data and statistics are not equals (cont’d) • Ie, data are: • the raw materials from which statistics are generated • ideally, available at the level at which the data were originally collected (=microdata) • need to be manipulated with statistical software in order to be comprehensible
record layout
variable description (aka data dictionary)
province gender
Metadata are… • Instructions to explain the content and coding of a data set (whether numeric, alphabetic, or other), and aid in their correct interpretation • Can be intended for human or computer consumption, but are ideally both
Raw data + a syntax file, processed through a statistical software package results in a system file – average shelf life less than 10 years
The beginnings • Hollerith cards first used to process the 1890 US census of population • By 1930s, public opinion polling was being used to eg predict electoral outcomes • 1936 Literary Digest poll predicted defeat of Roosevelt in the US presidential election • Data gathering make-work projects in the 30s in the US, such as economic censuses, surveys on unemployment, crop production, etc
By the 1940s • Polling and survey taking matured • Beginnings of improved sampling methods, such a Gallup’s quota samples • 1948 polls chose Dewey over Truman in the US presidential election, leading to formation of a committee to determine why the error • the Roper Center was created, the first data archive (1946) • Data stored on punched cards, and analyzed using card sorters and similar equipment • And metadata usually looked like this…
The 1950s… • UNIVAC 1, the first alphanumeric computer • UNIVAC 2 correctly predicted the Eisenhower sweep in the 1952 US presidential election • MIT began working on keyboard entry • Development of the COBOL compiler and Fortran • Magnetic tapes, at 200 bpi, could store the contents of 70,000 punched cards, ie about 5.6 megabytes of data • Lucci & Rokkan promoted the idea of data management by libraries
But the metadata for the August 1958 Canadian Gallup poll still looked like this…………
1960s… • Development of Basic, the Unix operating system, and ASCII which allowed interchange of data among different computers • Statistical software packages: DATA-TEXT, SPSS, P-STAT, BIOMED, NUCROS, SAS • Magnetic tapes moved from 556 to 800 bpi • Most social scientists were still writing own local software, or using card-sorters and calculators to produce cross-tabulations and compute chi-squares
1970s a watershed decade… • Microprocessors, and 8” and later 5-1/4” diskettes • Wang word processor, Ataris, Apple 1 and the Commodore PET • dBASE, VISICALC and WORD STAR • ARPANET and expansion of time-sharing and online systems • Online bibliographic services such as Dialog, BRS, and Orbit
1970s (cont’d) • David Nasatir wrote first manual on data management under aegis of UNESCO (1972) • Mid-decade saw the creation of IASSIST, and the first training at ICPSR for data librarians • US census of population 1970 partly disseminated on computer tapes instead of print, forcing libraries to consider this new medium
1970s (cont’d) • OSIRIS software developed at University of Michigan, included statistical capabilities as well as outstanding data and metadata management • NSF funded the National Conference on Cataloging and Information Services for Machine-Readable Data Files at Airlie House in Virginia • US Department of Justice funded the project which resulted in Roistacher’s Style manual for machine-readable data files – bibliographic identity, methodology, and data dictionary
An OSIRIS codebook generally followed the Roistacher recommendations. The record layout and data dictionary portion looked like this:
1980s • Supercomputers and NSFNET changed face of large scale computing, and PCs and MACs did the same for small scale computing • BITNET, followed by the Internet, provided e-mail, listservs and remote login • tape cartridges held the equivalent of 8 million cards or four times that of a 6250 tape. Five megabyte hard drives became available for microcomputers • IBM brought microcomputing to the academic sector • CD-ROMS, and the Quadra directory of databases
1980s (cont’d) • Sue Dodd’s Cataloging machine-readable data files : an interpretive manual, 1982 • Social forces one of the first journals to include guidelines on citing machine-readable data files • Population index the first bibliographic journal to cite data files • A draft revision of AACR2 chapter 9 (renamed: Computer Files) was published in 1987 – bibliographic control for data files
1990s • Migration from IBM mainframes (EBCDIC) to Unix (ASCII) • Demise of tapes for storage, in favour of widespread use of CD-ROM • Statistics Canada makes the electronic products from census the primary product • Gopher, developed in 1991, was replaced by the WWW and html, and by 1996 there are about 100,000 web servers • Beginning of the DDI (Data documentation initiative) project in 1995, published its first DTD in 1996
Three major developments lead up to DDI: • OSIRIS’ metadata management capability • Roistacher ‘s outline of machine-readable data file documentation (1980) • Dodd’s cataloguing manual (1982)
OSIRIS metadata • OSIRIS dictionary provided structural information: location, size, missing data, a variable name and a variable label (brief) • OSIRIS codebook provided a tagged format: • Introduction (unstructured) • Full question text • Variable values and value labels • Variable-level comments • North American institutions standardized on the OSIRIS type-1 and type-4 codebooks, Europe on the type-3 format codebook
Roistacher’s style manual • Provided outline of the information that should be contained in the full metadata (aka codebook), including • Bibliographic identity • Project history • File processing summary • Data dictionary contents • Recommended appendices
Sue Dodd’s cataloguing manual • Further refined the bibliographic identity component of the metadata • Provided a cross-walk to AACR cataloguing rules • Provided the foundation for the development of a MARC record • Dodd also defined the components of a bibliographic citation
Many kinds of metadata for many purposes • Data collection • Data interpretation • Data preservation • Data discovery • Coding standardization
DDI provides a format … • From which other subtypes of metadata (bibliographic records, syntax files, question banks, etc) can be generated • Describes not just microdata but also an intelligent means of describing aggregate statistics as data • Can incorporate all documentation from original project conception to edition management and post-processing
DDI provides a format … (cont’d) • 3rd generation data access tools (Nesstar, DDI, and Dataverse (VDC)) all support DDI 2.0 at present and provide a useful way to provide on-line remote distributed access to data discovery and data • Leads to proliferation of new applications of metadata and realization of initiatives from earlier decades
Lessons learned • Three killers of data: • Software dependence • Lost metadata • Physical medium on which data are stored • No solution as yet combines data, full metadata and statistical capability in a non-software dependant format