Effective Data Management Strategies for Research Institutions

SCD Research Data For UCAR Data Management Working Group January 10, 2001 Steven Worley Scientific Computing Division Data Support Section

Four Categories of Data Service • User Profile • Data Content • Data Access

Four Categories of Data Service • Archives directly from the MSS • Accessible to all with NCAR computing accounts • Web accessible online data server • Information interface for all data • Individual requests • Customized on per request basis • Data preparation for large projects • E.g. Reanalyses at ECMWF and NCEP

User Profile, MSS User Groups

User profile, online data server • Users based on network address domain, data for 1995-1998 • ~ 20K unique addresses per year

User profile, individual request • Requests excluding CD-ROMS • Based on 1998-1999 data • 28% U.S. Univ. (179 of 638) • 11% Foreign Univ. (69) • 27% Foreign Non-Univ. (171) • 34% U.S. Gov. and Commercial (219) (remarkably, some foreign and government sources find it desirable to acquire their own data from SCD/DSS)

User profile, all users All users by year, excluding online category

User profile, finding the data • Peer and colleague recommendations • Acknowledgements in publications • WWW searches and perusing

Quick look at DSS Information Interface • Website, dss.ucar.edu • Top level information and dataset groupings • Oceanographic datasets by Category

Important improvements for the Information Interface • More top level documents to guide users to the “best” datasets • For improved searches • Carefully worded .html <title> .. </title> • Pages with introductory text that clearly defines the dataset with keywords that promote discovery. • .html <meta tag>..</meta tag>, note, not all search engines boost ranking based on these.

User profile, compliments • Fast service, requests receive prompt action. • Staff with scientific knowledge to offer assistance and guidance. • Flexible system – can adapt to meet users requirements.

What makes this system work • The data records and files remain in simple structures • This way the archive should always be accessible to programs written with low level languages • The data can survive evolutions in OS systems and software, 50-years is not too much. • Programs can be written that allow fast and efficient manipulation of large collections. • Internal checksum keys can be strategically placed to insure data integrity – at any level.

User profile, complaints • All the data is not online – even though this quite impractical – 12+ TB • All the data is not in their favorite format, IDL, HDF, netCDF, GrIB, ASCII, GIS, Binary, .xls , Matlab, etc. • “Can I just get the piece I need?” • “Do you mean I need to know some FORTRAN or C Language?”

User Profile, skill set • Best skill set for our users includes knowing some FORTRAN and/or C. • Trend; more and more people are requesting data in application environment specific formats • Will the next generation scientist know a basic computing language?

Data Content, size and characteristic • Veritable smorgasbord of data. • Overall size, 12+ TB • 500+ distinct datasets • Many historical observations from the atmosphere, and ocean • Many operational analyses and reanalyses • Dataset sizes, < 1 MB to several TB • Many original formats. GrIB is dominate in our analyses and reanalyses datasets

Data Content, metadata management • Primarily, metadata is managed on our online information server. Each dataset has a WWW page. • All dataset WWW pages are automatically formed. • Corrections, addition, and changes are made to text files manipulated under a Unix change and control system. • Advantage: history of all changes and data files associated with the dataset, and the WWW pages are always current.

Data Content, metadata management • Have considerable amounts of hard copy references and metadata. - We are making scanned images of these now.

Data Content, long term archive and security • Small datasets and irreplaceable observations and analyses have two copies on the MSS • Although we cannot guarantee they reside on separate cartridges • Files are write password protected – prevents accidental overwrites. • We have been fortunate to have a very reliable MSS and our success will continue to rely on it in the future.

Data Content, long term archive and security • Areas of concern • We don’t have adequate offsite backups • At least critical observations should be protected from catastrophe at the Mesa Lab • In the event of loss of single copy large datasets we rely on other centers for replacement • This needs to be discussed more nationally • Redistribution may have restrictions or be costly

Data Content, long term archive and security • Areas of concern, continued • Must always remain on guard so important data are not lost due to short sighted policy decisions. • Must participate in national and international projects so that the archive content is continually refreshed with the most scientifically important data, at low cost.

Data Access, annual summary

Data Access, aids to access • Maintain FORTRAN code to read all data files • Sometimes for many platforms (Unix, PC) • The MSS file location is defined for all datasets, and is available online. • Staff specialist are assigned and identified for each dataset

Data Access, most frequent • NCEP/NCAR Global Atmospheric Reanalysis, 2.6 TB • How? • MSS • WWW (monthly means) • CDROMS • FTP • Various Tape Media (large capacity)

Data Access, largest barrier • Discovering what is available • Gaining access to the MSS collection (when they don’t have a computing account) • Not having experience with low level languages, e.g. FORTRAN and C/C++

Data Access, product development • Yes we do, and we feel it is very important! • Why? • Can QC the data and identify problems early • Can reorganize into logical collection, or create popular subsets. • Reduce the volume of large collections to manageable size for users • Saves many users extra work

Data Access, improvements for scientific advancement • Minimize the barriers that inhibit discovery – metadata problem. • Supply the data in the users favorite format or provide tools that can convert the data where it is practical and efficient. • Place more data, and valuable higher level data products on line

END

Effective Data Management Strategies for Research Institutions

Effective Data Management Strategies for Research Institutions

Presentation Transcript

SCD Research Data

Pregnancy and SCD

SCD 9

SCD: TCAM Library

Bryan N. Patenaude, ScD

CDC+SSIS = SCD

SICKLE CELL DISEASE ( scd )

Statewide Construction Database (SCD)

Research Data

SCD Update

Sudden Cardiac Death (SCD)

SCD-HeFT и COMPANION

Inhospital SCD

SCD in Horizon 2020

SCD

SCD Visio ToolBox

Data management in SCD Steven Worley

SCD Research Data Archives; Availability Through the CDP

Epidemiology of SCD & SCA