500 likes | 605 Views
EaGLe: Data Archiving and Metadata. The EaGLe Legacy. R-8286750. Why archive the EaGLe data?. To ensure its preservation for future generations of scientists To ensure it is broadly available for current scientists to use
E N D
EaGLe: Data Archiving and Metadata The EaGLe Legacy R-8286750
Why archive the EaGLe data? • To ensure its preservation for future generations of scientists • To ensure it is broadly available for current scientists to use • To create the broadest possible public benefit from this taxpayer-funded program • To help EPA retain the data that is collected / created through its funding • Because we wish that earlier researchers had archived their data for us to use
EaGLe Data Committee Mission Statement • Develop an information management plan to archive EaGLe data with appropriate metadata so that EPA can make it readily available • Ensure that data usefulness outlives the EaGLe project (and does not require continued maintenance by EaGLe researchers) Skip
1 EaGLe Data Types • Geospatial & Imagery • Genomic • Remote Sensing • Biological • Routine Monitoring Go Back
2 What data must be archived? All new data created or collected using EaGLe funds • Field data • Genomics experiments • New GIS coverages • New remote sensing data • Other images, models All important summary, supplemental, and explanatory information • Journal articles • Poster Sessions • Presentations • Rules governing data QC or transforms • SOPs, protocols, experimental design documents, QA/QC documents Go Back
3 Types of Data Objects • Literature Objects • Journal Articles, Bibliographies, Books, Adobe.pdf files, etc. • Flat Files • Stand-alone tables (i.e., SAS tables), spreadsheet data • Relational Databases • Many normalized tables joined by relational rules • Data views, query objects: combined bits from separate tables • Graphical Objects • Maps, photos, digital sounds, presentations, Web sites • Material objects • Soil samples, stained slides, microfiche, posters, video tapes,etc. Go Back
4 What is a Data Package? • Together, electronic data objects and their metadata file constitute a Data Package. • The metadata file is like the box, inventory tag and instruction manual • The data themselves are the content of the package • Data inventory requires good-quality metadata • Even material objects can have electronic metadata Go Back
5 What’s metadata ? • Metadata means “beside the data” or “data about data” • Metadata files contain summary and reference data about primary data objects: • Any information needed to identify, decode, interpret, track, store, locate, assign ownership of, or control access to a data object. • Everyday examples: • Library card catalogue; Key to map symbols; Checkbook register • Scientific Metadata examples: • Particulate matter instruments: equipment models and settings, detection limits, replication, sample handling details • Journal article citation, methods citation • Sample indented metadata Go Back
6 Why Collect Metadata? • Long-term Storage • Keep EaGLe data safely banked for future reuse • Support long-term data tracking and retrieval • Data Broadcasting • Publish metadata via the Environmental Research and Science Library (ERSL) public interface • Foster collaborative and cross-cutting research • Meta-analyses made possible—small dataset mergers • Cross-regional data, cross-media data • Longitudinal time-series analyses—data recombining Go Back
7 The ‘cons’ of metadata • Content: What is in the data object? • Data descriptions, citation info, electronic file formats • Contacts: Who owns the data? • Authors, contact person, organization • Context: What is the provenance of the data? • Applicable knowledge areas, methods, project origins, etc. Go Back
8 The ‘locs’ of metadata • Location • Where is the electronic file located? • What is the geographic coverage of the data object? • Locks • Final version (protected against inadvertent updates) • Viewing access controls • Editing/downloading access controls • Release date, expiration date Go Back
9 Sample Indented Metadata file Go Back
10 Sample 2 indented metadata • Switch to “Normal” view • Click on icon • Press Page down key to view PDF • When finished, press ESC key to restore “Normal” view • Use slide show icon to resume Go Back
11 Getting in Gear: • Feb. 1, 2004: Begin metadata creation. • Summer 2004: Begin EaGLe data uploading. • Jan. 2005: EaGLe metadata completed. • End of no-cost extensions (early 2006): Most of EaGLe datasets archived but password-protected. • Jan. 2008: Most of EaGLe data released to public Go Back
12 Metadata Creation / Data Uploading • Metadata Entry Form (MEF) • Generates an EML-compliant metadata file in XML format • Automatic upload to ERSL • Data packages stored in EIMS repository (ERSL backend) • EaGLe Portal—intranet interface for grantees • Review, Approval, and Release Processes • Post-Release: Search, Store and Update • Searchable Metadata Records in one area of EIMS/ERSL • Actual Datasets stored in EIMS/ERSL Repository Go Back
13 Metadata Checklist • General Information • Data Set Title • Point of Contact • Time period of the information contained in the dataset • Abstract (brief description) of the dataset • Geographic coverage of the dataset • Data format (i.e., shape-file, coverage, spreadsheet, etc.) • Dataset Creation • Formal authors • Others who contributed • Research objectives for dataset • Common misinterpretations of the data, if any Go Back
14 Metadata Checklist (continued) • Dataset Contents • Was a georeferencing system used? If so, what is it? • What does each dataset record describe? • What are the attributes that describe these features? • Define each attribute and provide measurement units. Also provide resolution and estimated accuracy, if possible • Define or reference coded attributes (e.g., FIPS codes, error codes) • Dataset Processes • Citation of source of original data, if applicable (e.g., GIS data) • Types of major data processing steps • Detailed methodology of data collection, including study designs, protocols, equipment, analyses, etc., and any changes in data collection procedures during the study • Record any QA tests performed and their results Go Back
Files converted into character delimited ASCII files (i.e., comma delimited .csv files) jpeg, jpg, tiff, gif, img, png, geo-tiff, ecw, ArcView, simple html or htm, xml, LaTeX, TeX, pdf (method files) Programs in programming language (must have text support). Excel Spreadsheets (convert to .csv) Presentation files such as PowerPoint (convert to .pdf) Word-processing files (convert to ASCII) Proprietary files RTF files Special characters (Greek letters and other symbols not found in ASCII) 15 Data File Formats: Unacceptable Acceptable Go Back Go End
A) Standards for Metadata Creation FGDC Content Standard for Digital Geospatial Metadatahttp://www.fgdc.gov/metadata/contstan.htmlhttp://www.fgdc.gov/metadata/metadata.html National Biological Information Infrastructure http://www.nbii.gov/ Ecological Metadata Languagehttp://knb.ecoinformatics.org/software/eml Knowledge Network for Biocomplexity (MORPHO)http://knb.ecoinformatics.org/ Dublin Core Metadata Element Setwww.dublincore.org Encoded Archival Description (EAD) http://www.loc.gov/ead/ Data Documentation Initiative http://www.icpsr.umich.edu/DDI/ Go Back
B) So, what’s EML? • Ecological Metadata Language • A metadata standard designed to handle cross-disciplinary research • A ‘wrapper’ that holds metadata for many different types of primary data (geospacial, biological, genomic,etc) • Widely accepted standard in the ecological communities of interest. • A container that meshes with other types of metadata standards • A metadata standard based on XML vocabulary. • An information ‘tree’ that can graft on new branches of knowledge when they become necessary to the knowledge community Go Back
C) EML: Standard for Ecological Metadata • Core: Definitions and units of the columns (fields or attributes) in all data tables • Methods, procedures, and protocols • Research questions and hypotheses • Site selection • Authors, contacts, and proper citation for use • Sampling Extent: spatial, biological, & temporal • Sample Indented Metadata Go Back
D) What good is EML? • Ease of data interchange with other scientists • Enhances precision in data documentation • Forces clarity in defining measurement units • Missing-data codes, other interpretative codes • Enforces data access rules • Improves rapid search capability Go Back
E) EML Specialty Terms Go End Go Back
F) What is XML? eXtensible Markup Language • A subset of Standard General Markup Language • A method for marking up plain text • To distinguish clearly between the: • content (text) • document structure (title, paragraph, line, etc.) • Note: Textual attributes (bold, large, italic, etc) are NOT included. • To make electronic documents readily machine-readable • Makes document structures explicit and modular • Permits easy transformations between document formats Go Back
G) What good is XML? • Allows document contents to be re-used in new ways • Allows document elements to be stored just like tables of numerical data • Enforces precise translation of document “look and feel” from one presentation mode (hard-copy) to another (web) • Transparency of markup to future readers • Can accommodate new kinds of text markup at need (audio tags, motion tags, etc) • Converts information to platform and software independent formats to maximize long‑term utility Go Back
H) Do I have to learn XML? NO! • The Metadata Entry Form automatically creates a valid XML document • Data entered into the form automatically follows the EML constraints on mandatory inclusion of metadata elements • Only system administrators and metadata librarians need XML expertise Go End Go Back
Information Objects: • Data Sets • Databases • Documents • Meetings • Models • Multimedia • Projects • Spatial Data • Web Site Metadata (data about data) IJ) EIMS overview Go Back
Future… EaGLe Portal Data retrieval from EaGLe internet portal K) Data Flow: From You to EIMS & back EaGLe Metadata entry into existing EaGLe system Data load into EIMS EIMS Data update / retrieval from EaGLe intranet portal into EIMS Go Back
L) EaGLe Prototype Home Page Go Back
M) EaGLe Prototype Global Search Enter selection criteria and click Global Search Go Back
N) EaGLe Prototype Search Results Click link to display Metadata Report Go Back
Header Link to top of page O) EaGLe Metadata Report Links to headers in the Metadata Report Go Back
S) EaGLe Prototype Simple Search Enter selection criteria… …and click Search Go Back
Enter selection criteria… T) EaGLe Prototype Advanced Search Go Back
Enter selection criteria… …and click Search U) EaGLe Prototype Advanced Search (continued) Go End Go Back
Optional Data Archival • Historical data owned by EaGLe researchers • Data used strictly for QA/QC • e.g., temperature of experimental tanks • Work that produced no analyzable data • Qualitative reports • Pilot data Go Back
Do NOT Archive • Data not owned by EaGLe researchers • Data already archived elsewhere • e.g., many GIS coverages • “Dirty” data • Sans quality controls • Containing many missing values, duplicates, etc. Go Back
Non-standardized metadata • Field notes • Marginalia • Large object free text fields • Index cards • Voice recordings • Personal communications • Mental notes (non-transcribed knowledge) Go End Go Back
Who is working on EaGLe data archiving? • EaGLe data committee (EDC): Valerie Brady (chair) Terry Brown (GLEI) Peter Noble (CEER-GOM) Lexia Valdes (ACE INC) Webb Sprague (PEEIR) Chris Pfeiffer (ASC) • Environmental Information Management System (EIMS) John Sykes (USEPA EIMS) • Computer Sciences Corporation (CSC) Derek Lane Susan Eversole Steve Walata III Geoff Blair Wally Schwab And others Go End Go Back
1$ Seems awfully complicated… ...but it’s easier than statistics • No need to learn whole of EML to use the relevant bits • No more complicated than programming a VCR • Time, Date, Channel, Skip commercials • Similar to writing a journal article • Abstract, Background, Protocol, • Methods, Analysis, Discussion, Results, • Caveats, Secondary analysis potential • Author Names, Affiliations, Bibliography • EaGLe MEF or Morpho user-interface allow production of the most useful metadata Go Back
2$ How much does it cost to collect metadata? • Estimate the value of your research results • Total amount of research grant(s) plus 15% added value • Divide by number of years project is funded • Allocate 10% of resulting $/efforts to metadata collection • Distribute amounts evenly over years—don’t stint! • Collecting metadata at the beginning of a study captures important data decisions and research design elements • Use metadata collection as an ad hoc method of data quality control during each year of the study. Go Back
3$ How much time is this going to take? • Between 8 and 40 hours per data group • All similar data bundled together—not a per dataset cost! • More complex datasets take more time • Loading or linking to pre-written material can save time • Training for use of Metadata Entry Form • One-time 3-hour training session • Minimum 3 hours hands-on practice • Availability of live “help” during first solo MEF work Go End Go Back
4$ What Good are Metadata? High quality metadata serve 5 purposes: Data Integrity Maintenance over the long term: 20-year rule • Across expected changes in data storage technology, compression, etc. Tracking, searching for, and retrieving datasets • Like a library card catalogue—where to find data, where to shelve it. Scientific collaboration • Joint analysis and secondary analysis potential Cathedral effect • Pooling data across regions contributes to an environmental “big picture” • Longitudinal studies--building science efforts upon a shared data foundation. Economical • Extending the shelf life of data gives taxpayers more return on investment Go Back
5$ Who needs the EaGLe metadata? Other scientists Today’s Colleagues & Scientific Collaborators Tomorrow’s meta-analysts The next generation Archivists Data Librarians The Public Data Exchange Tools (CDX) Citizens and Citizen Groups Legislators and other decision-makers Go End Go Back
1) Data Access and Security • Only registered users may enter or edit a metadata record • Record-level edit permissions required for input and update • Only registered Data Librarians can release records to a designated user base (Public, EPA Only, Group, Owner) • Confidential records can be restricted to a subset of users • EPA Only – accessible only to EPA registered users • Group – accessible only to members of a specified group of users (including system users outside the EPA firewall, if necessary) • Owner – accessible only by the designated owner of the EIMS record • Post-release: any internet user may view metadata records. • Separate access controls for actual datasets Go Back
Generations of Research For a true confluence of research efforts, clarity in metadata is the key