380 likes | 528 Views
Information Management in a Non-Bibliograpic Environment: Scientific Data. Joseph A. Hourclé 2007-Nov-20 FLICC Learning@Lunch. About Me. STEREO : Solar TErrestrial RElations Observatory. The Virtual Solar Observatory. The Virtual Solar Observatory. Federated Search of Solar Physics Data
E N D
Information Management in aNon-Bibliograpic Environment: Scientific Data Joseph A. Hourclé 2007-Nov-20 FLICC Learning@Lunch
The Virtual Solar Observatory • Federated Search of Solar Physics Data • 14 organizations (currently) • 4 more organizations being integrated • 62 instruments • Hundreds of distinct data collections • 10s of millions of records • Terabytes of Data
The data is growing … • STEREO • Launched Oct 2006 • Over 1.5 million images @ up to 8MB • Hinode (Sunrise aka Solar-B) • Launched Sept 2006 • Over 3 million images @ up to 8 MB • SDO • Scheduled to launch Aug 2008 • 1 image per second @ 32 MB • 1.5TB/day dedicated connection
Other disciplines have even more data • NVO : US National Virtual Observatory • LSST (Large Synoptic Survey Telescope) • Scheduled to start observing in 2012 • 7-10 TB/night, 3.2Gpix images • ~10 PB/yr • EOS/DIS : Earth Observing System/Data Information System • About 2TB/day, per satellite (8?) • Planned to be 16 PB
… and we’re not the only one • Heliospheric • Magnetospheric • Radiation Belt • ITM (upper atmosphere) • NVO / IVOA : nighttime astronomy • PDS : planetary • EOS : earth
How is Scientific Data Gathered? • Scientist thinks up a problem • Scientist (and Engineers) create an instrument to conduct an investigation • The instrument collects data via sensors • Data are calibrated • Data are written into scientifically useful formats • Data are distributed to the scientists
But really, what is data? • There is no formal definition. • It’s as ambiguous as the term “book” • Data may be shorthand for: • Data Collection • Data Series • Data Set • Data Product • Data Granule
The problem with “data” • Every investigation has different data needs • Each investigation organizes and catalogs the data to answer their scientific question • What is “good” data for one group may not be useful for another • Because data is being collected continuously, there may not be a consistent boundary on one “granule” of data • Some data is tracked as individual values, and only packaged upon request • Mostly time-series data, not images
Types of Data Archives • Instrument Archives • Maintained by the PI team • Little or no consideration towards re-use • Resident Archive • Maintained by a specific discipline • Re-use within the given discipline • Long-Term Archive • Required for federally funded studies • Focus on preservation, not use of data
Active Archives • Still changing • May be ingesting from an active mission • May still be processing their data • May serve multiple editions or processed states of the data • Final Data in “Physical Units” typically isn’t available until one or more years after the mission • Not directly comparable with data from other instruments until then
Isn’t this just Knowledge Management? • There is no knowledge in the raw data • But there is knowledge in the design of the instruments & sensors • What spectral range are the instruments sensitive to? • What are the instrument’s possible operating modes? • Knowledge of the instruments & sensors affect how the scientists interpret data • The scientists have to interpret the results to determine the knowledge • May be reluctant to have others catalog their data, as it requires understanding the science
Multiple Operating Modes:Filters on SOHO/EIT 171Å 195Å 284Å 304Å
Knowledge Mgmt, cnt’d • We do have ‘Event’ and ‘Feature’ Catalogs • Scientists will record when/where they think something interesting is occurring, and share with others.
CCD Calibration 195Å 171Å 304Å 284Å
The Problems … • Cross discipline translation is difficult • Concepts of what makes data useful differs between disciplines • Different disciplines use different search parameters • VSO : time, spectral range, location on sun • Always looking at the same object • VHO : location of observer, time, spectral range • Observatories are moving, in situ measurements • EOS : location of object observed • NVO : direction of pointing (assumed from earth)
Problems, cnt’d. • Even when there is agreement, there are still problems • Which time is important? • Start time? • Average time? • Spacecraft time? • Which coordinate system is used?
Problems, still cnt’d • Each discipline is working on solutions within their field • Build systems that suit the needs of their community • Each discipline has different “first class data” • Currently working on metadata standards so data can be discovered and used by other disciplines • SPASE; MMI; GEON • Some work on ontologies to help with discovery and use • VSTO; SWEET; GEON; SESDI
How does this affect libraries? • The library is a changing organism • Data is relatively unanalyzed in LIS • Data connects to bibliographic records, and visa-versa • What data was used in this journal article? • Where can I get documentation on using this data? • Has anyone published anything using this data? • Data connects to other data • What other instruments observed a given event? • Is there an alternate version that better meets my needs?
There’s funding for research • NSF: • CDI : Cyber-Enabled Discovery and Innovation • INTEROP :Community-based Data Interoperability Networks • IIS : Information and Intelligent Systems • DataNet : Sustainable Digital Data Preservation and Access Network Partners • NASA: • AISR : Advanced Info. Systems Research • ACCESS : Advancing Collaborative Connections for Earth Science Access
Sunspot on 15 July 2002 from the Swedish 1-m Solar Telescope on La Palma
http://virtualsolar.org/ http://stereo.gsfc.nasa.gov joseph.a.hourcle@nasa.gov