590 likes | 683 Views
Regional Databases and Archives: the Effects of Scale…. A Presentation for “Scalable Information Networks for the Environment Workshop” October 31, 2001 San Diego, California Raymond McCord Oak Ridge National Laboratory*
E N D
Regional Databases and Archives:the Effects of Scale… A Presentation for “Scalable Information Networks for the Environment Workshop” October 31, 2001 San Diego, California Raymond McCord Oak Ridge National Laboratory* *Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725
Credits • Concepts are derived from managing data for environmental projects over the past 25 years. • Variations of the concepts have been observed from these disciplines. • plant community research • impact assessment in marine systems • national acid rain surveys • Environmental monitoring and cleanup projects at DOE facilities • Military land use assessment • Climate change research (atmospheric research) • Ideas are freely traded with Dick Olson (ORNL)
Presentation Strategy • Motivation and concerns • Archive overview • Definition, components, functions, why & why not, examples • Archives and scale • Effects of scale • Mitigate scale effects • Generate and manage metadata • Future: Archive issues to resolve
My Motivation & Concerns The enemy is our behavior. Will we change or whine??? • Motivation • Describe observations about the effects of scale on Archives • Describe remedies to minimize scale effects • Minimize remedy pain • Concerns • Preaching to the choir!! • Nothing new will happen!! • Continuing unnecessary limits to future science!!
You can’t keep running in here and demanding data every two years Challenge: engage scientists in the process of archiving their data and provide the mechanism for archiving. Source: American Scientist,Vol 886 p 525.
Archives and Scale: Presumptions • Regional data live in Archives • Information sharing is important • The archiving can be improved • Archive “neurons” are metadata • Multidisciplinary data will foster broader ecological discoveries • The limited number of permanent data archives for ecological data will increase
What Is a Data Archive? • A data archive is a permanent, electronic collection of datasets with accompanying metadata such that users of the data can acquire, understand, and use the data. • More than a long-term backup • More than an index or catalog with pointers to datasets stored elsewhere • For more details, see Michener, W. A. and J. W. Brunt. 2000. Ecological Data: Design, Management and Processing. Blackwell Science. 180 pp.
Components of an Archive • Data and metadata • Storage devices • Information system • Network connections • Staff • Data/metadata preparation and review • Systems development and maintenance • User support
Archive Functions • Store data • Submitted by others • Build catalog and structure • Maintain storage across technology generations • Review new data (QA, metadata) • “Advertise” contents • Find data for users • Query and browse logic • Distribute data • Provide access to data • References to documentation
Data Centers at ORNL • CDIAC - Carbon Dioxide Information Analysis Center • ARM Archive - Atmospheric Radiation Measurement Program • ORNL DAAC - Distributed Active Archive Center for Biogeochemical Dynamics • NARSTO - tropospheric air pollution information for North America • OREIS - Oak Ridge Environmental Information System
Atmospheric Radiation Measurement (ARM) Program • ARM research questions: • What happens to all of the sunlight energy? • How is light absorbed by clouds? • What does partly cloudy mean? Statistically? Spatially? • What types of clouds form? When and How? • ARM is a ‘once in a lifetime’ research adventure for atmospheric scientists • ARM research includes instrumentation, system development, data analysis, and modeling (climate and process)
ARM Measurements Scope All data collection is highly automated -- a REAL BLAST!! Data collection is now a peer outcome with scientific discovery
ARM Archive • ARM Archive stores and provides access to the entire accumulation of data • Currently 5 million files and 14,000 GB and growing • The ARM data in the Archive will be accessed for research for many years (decades) • Currently distributes 50-100,000 files per month (100-200 GB) • More information: • ARM Program www.arm.gov • ARM Archive www.archive.arm.gov
Archive webUser Interface ARM Archive Schematic“Archive Input & Output” user copy Requestedfiles query specifications location DataRetrieval measurement date catalogmeta data filelist IncomingData Files DataReception Other ARM Systems MassStorage System backupdata files operationsmeta data
Data and Metadata Submission Data/ Metadata Ingest Backup, Security, Migration Archive Development and Maintenance User Support Request pathways User Request Archive support User interactions Data Flow Data Metadata User Interface Network Core archive functions
Why Archive?? “I am doing Science. Trust me.”
Cycles of Research“An Information View” Archive of Data Publications Automation and review Selection and extraction Analysis and modeling Information review Measurement Collection Original Observations Secondary Observations 200 yrs 20 yrs Planning Planning Problem Definition (Research Objectives)
Why Don’t I Archive My Data? • No incentives - what’s in it for me? • No acknowledgment - does a dataset = paper? • Give up publication rights - will somebody scoop me? • Poor planning - it was not in “the Plan” • No resources - who’s going to pay for it? • Lack of training - what do I do first? • Unsure about metadata content - how much is enough?
Why Should I Archive My Data?(management hints!!) • Career advancement (give them credit) • you will get some recognition • you can publish data paper in ESA Ecological Archives • it may help me do science with broader scope • Professional incentives (give them training) • good scientific practice (create peer pressure) • Institutional incentives (have expectations) • required by the sponsor • Technological advances (give them systems) • its easier and there are more options
Archiving Supports Science • Metadata required for archiving will improve data quality • Extends data usefulness • Increases your information base for doing research: • data volume and diversity • Permits replication of results A KEY concept of Science
The Effects of Project Scale on Archives “Metadata are archive neurons??”
Metadata Depends on Your “World View” • Investigator • Doesn’t need extensive formal metadata • Project • Metadata needed for project integration and modeling activities • Project data manager may help write metadata • Data archive • More detailed metadata (e.g., spatial coordinates) • More standardization (e.g., keywords) to communicate clearly with future users • Who writes the metadata?
(In the beginning, was the measurement. It was formless and desolate. Without context…) Measurement
Single Experiment View parameter name Measurement sample ID location date
Research Project View parameter name media QA flag Measurement sample ID location date
Long-term or Multidisciplinary View method parameter name Units media QA flag Measurement records generator sample ID location date
Integrated System & Archive View words, words units method Parameter def. lab field Method def. method Units def. parameter name Units media date words, words. QA def. Record system QA flag Measurement records generator sample ID location date GIS org.type name custodian address, etc. coord. elev. type depth Sample def. type date location generator
Increasing User Scope Project Scale and Recorded Metadata Metadata PI Group Program Archive • Units • Method • QA flag • Media • Parameter name • Measurement • Date • Sample ID • Location • Generator • Records
Data Maturation and Scale • Individual Investigators • collect data, quality assure, document, analyze, publish • Groups or Science Teams • collate data, enhance, synthesize, model, publish • Project Information System • collate data, review completeness, maintain data for project • Data Distribution and Archive Center • long-term archive, distribute freely to users • Master Data Directory • searchable index with pointers to data
I will not wait. I will not wait. I will not wait. I will not … Preparing for Archiving
Generic Environmental Data Model(Which Piece Is First…?) words, words units method Parameter def. lab field Method def. method Units def. parameter name Units media date words, words. QA def. Record system QA flag Measurement records generator sample ID location date GIS org.type name custodian address, etc. coord. elev. type depth Sample def. type date location generator
Sequence of Information Birth words, words units method Parameter def. lab field Method def. method Units def. parameter name Units media date words, words. QA def. Record system QA flag Measurement records generator sample ID location date GIS org.type name custodian address, etc. coord. elev. type depth Sample def. type date location generator
Research ~ Publishing ~ Metadata • Metadata design can be a “checklist” for research planning • Metadata preparation can be integrated with publication process • Metadata are an investment in current and future science
Archive Choices • What determines your options? • Sponsor requirements • Repository access • Metadata requirements • Scalable storage • Personal web pages and files • Project or network data centers • Federal data centers • Links “transcend” storage structures • Master directory • Mercury
Personal Web Page • Its fun, rewarding, relatively easy, can share data quickly, can control access to data • Data issues?? • complete metadata • QA checks • Connected to basic archival center functions?? • ready access to data (24 h/d, 7 d/wk) • user support • data available on multiple media • secure, backed-up, long-term storage
ESA Ecological Archives • Publishing datasets as peer reviewed, citable papers (with volume and page numbers) • Data papers are announced in abstract form in a print journal with data available electronically • Citation example • Esser, G., H.F.H. Lieth, J.M.O. Scurlock and R.J. Olson. 2000. Osnabrück net primary productivity data set. (Ecological Archives data paper E081-011). Ecology 81, 1177-1177. • Bill Michener, Editor • http://esa.sdsc.edu/esapubs/Journals_main.htm
Master Data Directory • Provides search capability and pointers to a source of the data (Center does not archive data) • Maintains standard keywords/indices • Collects metadata from many sources • Examples • Global Change Master Directory (GCMD) http://gcmd.gsfc.nasa.gov • ORNL DAACMercury System http://mercury.ornl.gov
Data and documentation User What is Mercury? 1. The data provider uses the Metadata Editor to create a metadata file containing links to the data and documentation NASA / ORNL Metadata Index 2. Mercury harvests the metadata and builds an index Mercury is used to assist an investigator with documenting data and making these data available to others. 5. User links to data provider’s server 6. Data and documentation are downloaded directly from the data provider 3. Users query the index 4. Full metadata are returned to the user, including links back to the data provider
Sources of Regional Data • Carbon Dioxide Information Analysis Center • National Geophysical Data Center • National Environmental Satellite, Data, and Information Service • National Soils Data Access Facility • National Water Information System • Forest Inventory and Analysis • Breeding Bird Survey • Threatened and Endangered Species • Global Change Master Directory
GSFC EDC SEDAC Upper Atmosphere, Global Biosphere, and Geophysics U. Colorado Land Processes Socio-economic JPL Cryosphere Ocean Circulation And Air-sea Interaction U. Alaska Sea Ice and Polar Processes LaRC Atmospheric Processes ORNL Biogeochemical Dynamics NASA EOSDIS Distributed Active Archive Centers
Precipitation Topography Soil Carbon Cloud Amount II Clear-Sky Albedo LW Radiation Fossil Fuel Emissions Vegetation Biophysics (fPAR) Global scale, 280 parameters: surface, atmospheric, fluxes
Future: Issues to Resolve • Size, diversity, and longevity • Accommodating change • Teaching good practices
Issues: Size, Diversity, Longevity • Size • Online vs. Offline • Database vs. File structure • Multiple institutions • Too big for technology migration?? • Diversity • Increased logic and documentation for “finding data” • Spatial distribution • Increased potential for uniqueness conflicts • Longevity • Too old to explain or decode • Too much evolution of methods and practices • Asynchronous change in data and metadata
Issues: Planning and Requirements • Plan for archiving early and ongoing • Avoids missing metadata • Avoids panic • Improves overall data quality and consistency • Consider the timing of requirements • Requirements • Standards: “to be or not to be?” • Documentation expectations • Accessibility “Its mine!! Its my data!! You CAN’T have it!!”
Research Implies Change … Research Not always true for other information systems repeat… Discovery New information requirements New questions
Issues: Accommodating Change • Change must be considered in the design • Things that will change • Access expectations • Logical hierarchy of information scope • New parameters • New disciplines • New study sites • New data sources or methods