80 likes | 325 Views
HISTORICAL CENSUS RESCUE PROJECT. Historical Census Rescue Project at UC DATA IASSIST 2003 Conference, June 28, 2003, Ottawa Canada Project Management Fredric C. Gey, Ilona Einowski, University of California, Berkeley Students: Natalia Perelman, Sungman Cho, Tien-hao Lan
E N D
HISTORICAL CENSUS RESCUE PROJECT • Historical Census Rescue Project at UC DATA • IASSIST 2003 Conference, June 28, 2003, Ottawa Canada • Project ManagementFredric C. Gey, Ilona Einowski,University of California, Berkeley • Students: • Natalia Perelman, Sungman Cho, Tien-hao Lan • Work performed under grant from California Digital Library Counting California project (http://countingcalifornia.cdlib.org) Fredric C. Gey
HISTORICAL CENSUS RESCUE PROJECT • Between 1972 and 1988 the Lawrence Berkeley Laboratory of the University of California acquired most known population counts in machine readable form from the 1970 and 1980 decennial censuses at levels of geography down to the census enumeration district and block group, as well as other auxiliary files from the Bureau and other sources such as 1947-1977 consolidated county and city data book and mortality detail files for 1965-1985 from NCHS. Included in this data are unique files which don't seem to be found at ICPSR such as 1960 population by county (1000 items) and 1970 census second count (single years of age down to census tract level of geography). Also included are 1970 Census tract boundary files used to produce the Urban Atlas Series of map portfolios. • Before the last running computer containing this unique database failed in year 2000 a complete dump of this data was made by the Census Bureau and sent to UC DATA on DLT tape (34 gigabytes). The final uncompressed version of these datasets should exceed 100 gigabytes in size. Fredric C. Gey
Lawrence Berkeley Laboratory SEEDIS System • Lawrence Berkeley Laboratory constructed an information system which stored and retrieved this data, called SEEDIS (Socio-Economic-Environmental-Demographic Information System) with the following characteristics: • 150 databases organized by geography (StateCountyTractBG/ED) • Geographic join across databases with common geography • Data extraction for selected geography and data elements to SPSS, SAS, CODATA (self-documenting data files) • Charting • Mapping Fredric C. Gey
3 generations of Mass Storage Photodigital chip store (1970-1980) GSS 6250 BPI tape robot (1978-1987) 8mm Exabyte tape jukeboxes Two generations of hardware Control Data supercomputer Digital Equipment VAX-VMS Lawrence Berkeley Laboratory SEEDIS System: 1978-1997 Fredric C. Gey
Databases in the SEEDIS System: • 1960 Census Population (county) • 1970 Census Population files (state, county, place, mcd, tract) • Second Count (100% population by single years of age) • Fourth count (Sample, 1178 items by 5 race-ethnic groups) • Fifth count (Sample, includes housing, BG/Enumeration District) • 1980 Census Population files (all geographies) • 1947-1977 City-County Data Book (state, county, place) • NCHS Mortality Summary files (1968-1984) • NCHS Cancer Mortality • EPA Air Quality monitoring statiions (1974-1976) • 1970 MED-X population centroid latitude/longiture) • 1970 Census Tract Boundary Files (polygon format) Fredric C. Gey
HISTORICAL CENSUS: Challenges to Rescue • Archiving format • GSS on Control Data Supercomputer • VAX Backup on Digital Equipment Vax machines • Compressed Data Format (up to 99 percent compression) • Run-length encoding • Nibble (half byte) smallest unit of storage • Computer architecture independent • Metadata transformation • SEEDIS data definition files (DDF, EDF) • Qwick Qwery Dictionaries • Eye readable dictionaries • NEED TO TRANSFORM TO DDI • Code: 100,000 lines of FORTRAN) • Selective recoding in C • New PERL code for transformation of metadata to DDI Fredric C. Gey
HISTORICAL CENSUS: Current Status • Decompression working on UNIX, Windows • 1947-1977 County Data Book almost done • 1960 Census almost done • 1970 Fifth Count MCD/Tract/ED/BG available for FTP • 1970 Second Count in GSS format • Hung up on VAX machine assembly code • DE, DC previously de-archived • Many “tapes” not scanned • Feeding to Counting California (SAS-based web interface) Fredric C. Gey