450 likes | 498 Views
Science, Data, You, and the Future : A variation on the “The 3 Little Pigs.” Which “Little Pig” will You be?!. A Presentation for “ NSF Facilities Users’ Workshop: Working Together to Meet New Observational Challenges.” September 24, 2007 Boulder, CO Raymond McCord
E N D
Science, Data, You, and the Future:A variation on the “The 3 Little Pigs.” Which “Little Pig” will You be?! A Presentation for “NSF Facilities Users’ Workshop: Working Together to Meet New Observational Challenges.” September 24, 2007 Boulder, CO Raymond McCord Oak Ridge National Laboratory* Oak Ridge, Tennessee *Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725
Outline • Objectives • Conclusions (already??) • Storytelling • Introduction to “the story” • Evaluation of current and future issues • Pathways to the future?? • An ending to “the story” • Conclusions (again!!)
Objectives • To present my assessment of current scientific data management practices and issues that need to be addressed in the future. • To be informative, provocative, and entertaining To StOP YOU from thinking about supper!!??
Conclusions (already??) • We are swamped with more information than we can access* • *Access is a broad topic (EPDUS = ????) • Our current practices may not be sustainable and reliable. • Exponential vs linear capacity increases • Optimization is unbalanced • Scientific expertise within data centers will improve future data access. • Science and data management must be integrated. • Many solutions are NOT technological, but behaviorial. • Think - Training • “Data Science” training must developed and implemented. • The needed changes will not happen by accident. • “My ~30 years of experience and systems observation suggests otherwise!!” • Sooner is better than later.
Storytelling • Storytelling is a VERY OLD form of “information technology” (IT) • Preservation and access • When old IT meets new IT • Supercomputer implementation • Just go ask! • Excuse for Analogies • Engages the listener • “The 3 Little Pigs” • “Once upon a time…”
About Raymond • Trained as a Theoretical Ecologist (landscape ecology) • Conducted extensive statistical analysis • Scientific data analyst (to pay the bills) • Tired of rerunning analyses at last minute to correct data management problems • Data manager / System “whacker” • GIS implementation in “early PC days” • Implementer and manager of progressively larger environmental information systems!! • Requires “research” outside of “science” • “Smell the fumes” of many scientific disciplines • Very few publications!!?? • Acquired respect???
Credits • The concepts presented are derived from managing environmental data and information systems over the past 30 years. • Variations of these concepts were observed from many disciplines: • plant community research • impact assessment in marine systems • acid rain surveys • environmental monitoring and cleanup projects at DOE facilities • land use assessment • climate change research (atmospheric research) • These concepts extend to other scientific disciplines.
Quotes from Raymond • “Storing data is easy. Finding and using data later is NOT…” • “Systematically and consistently organized data does not occur without cost.” • “The existence of “no cost”, well-organized data is not supported by the current situation” • “Consider the results from previous science projects with “no cost” for data archiving.” • “The natural tendency over time for data and information is chaos. Effort must be exerted to overcome this.” • “Successfully managed data by projects may not be ready to be archived. (for permanent access)”
Pop Quiz (Wake UP!!) • What is “access combination” to my lock? • Hints: • “I love it” • X=(Yz)/12) - z+1 • How is my necktie related to: • Data? • Metadata? • Scientists? • 2 year old children? • “Why do I care?” (Answers near the end of the presentation.)
“The 3 Little Pigs” • Characters • The Wolf • First Pig builds a house of Straw • Second Pig builds a house of Sticks • Third Pig builds a house of Bricks • What does this have to do with “Data, Science, You, and The Future…?”
Unending appetite Out of control Bad mannered Too clever? Exponential growth in: Data retention capacity and habits Data re-use demands Significant chaos in: Data automation styles Data documentation Lack of training in {ditto above!!} The Wolf Who will eat whom? Scientists or Data managers? Data in Data out
Gathered more and more of “what was at hand.” Wanted to go back to “being a pig”. Metadata catalogs Metadata harvesting Layers of ontologies Automated “data mining”?? Can we sort through all of the details? What about recommendations and priorities for use? When will the “straw quality” improve? Changing the “masses” The Straw Building Pig Stay out of the way of the Scientists
Did a bit more work to gather materials Used a bit more structure Did not have good specifications Only acted after First Pig failed Wanted to go back to “being a pig”. (after some effort) A mixture of data structures, metadata, and a few standards XML Links Automated data access Data warehouse Business information concept How do we know the balance of structure, metadata, and standards? What is the evolutionary pathway? Many “Sticks” to choose from Can we show the improvement over “Straw”? The Sticks Building Pig Work with the Scientists
Did a lot more work to gather (AND PREPARE) materials Used significantly more structure Required working with “a plan” Was a braggart over First and Second Pig Wanted to go back to “being a pig”. (after “winning”) Metadata standards and more standards Internet does not decide (distributed vs central) Removes ambiguity of definitions, but contents get “boxed”. What about Type I errors vs. Type II errors? An “odds box or junk bin” will always remain. “Bricks” are: Hard to change Slow and costly to make CHANGE is fundamental to SCIENCE (more later!!) The Bricks Building Pig Defeat or stymie the Scientists?
Elements of Data Preservation for Future Access • A “framework” for assessing improvements for the future • Restricts flow like irregular plumbing • “We want {more} Cake!!??”
Elements of “Permanent Access”… • “Permanent access” to scientific information requires ALL of the following: • Existence • Permission • Discovery • Understanding • Support
Existence • Definition • Information is recorded and retained. • Information can be found and used by “experts”. • Requirements • Information technology is used for recording and retention. • Scientists are trained and required to record and retain information. • Issues • The availability of information technology will far surpass the “ability” to use it effectively. • Training will be needed to extend “ability” beyond the immediate need. • Training must include both fact and philosophy. • Plans to use information technology must be “pushed” beyond the immediate objectives. • Need to establish reasonable and more “global” plans and objectives.
“Why Don’t I Archive My Data?” • No incentives - What’s in it for me? • No acknowledgment - Does a dataset = a publication? • Give up publication rights - Will somebody scoop me? • Poor planning - It was not in “the Plan”. • No resources - Who’s going to pay for it? • No future – Who will support this later? • Lack of training - What do I need to do first? • Unsure about metadata content - How much is enough?
Permission • Definition • Someone “beyond” the originator is allowed to acquire and use the data. • Requirements • Scientists relinquish control of the data. • Sponsors and agencies relinquish control of the data. • “They” not only allow future use, but encourage it. • Issues • Encourage data re-use. • Explain larger research objectives. • Reward data citation. • Balance openness and protection. • Allow early discovery. • Prevent resource abuse. • Protect individual privacy.
Discovery • Definition • Starts with the inspiration to look “here” for “what you want”. • Includes knowing how to find “what you want”. • Ends with recognizing it when “what you want” is found. • Requirements • Logical organization. • Good and meaningful metadata (categories and keywords). • Multiple pathways for discovery. • Issues • Documentation must be significantly extended beyond the “local view” of the data. • Documentation development is “not career building” for scientists. • Interactions between “developers and users” must be sustained.
An Initial View of Data… Measurement
Single Experiment View parameter name Measurement sample ID location date
Integrated System & Archive View words, words units method Parameter def. lab field Method def. method Units def. parameter name Units media date words, words. QA def. Record system QA flag Measurement records generator sample ID location date GIS org.type name custodian address, etc. coord. elev. type depth Sample def. type date location generator
Comparison of User Interface Options Interface name Accessible data “Shopping” approach (armarchive@ornl.gov, 1-888-ARM-DATA) ARM Data Browser Routine ARM data “I know what I want. Do you have it?” Searching with predefined selection criteria. Catalog Interface Routine ARM data “I am not sure what I want. I need to see what you have available.” Browsing a hierarchy of availability summaries. Thumbnail Browser Most routine ARM data “I will know what I want when I see it.” Searching with a combination of predefined selection criteria and visual review of data plots Web Shopping Cart Routine ARM data and some IOP data “I need to read about what you have, then I will decide.” Discover areas of interest by browsing the ARM web documentation and collect items of interest. IOP Data Browser IOP, special, PI, and beta data “I need to look in the odd parts bin.” Direct access to IOP data. Navigate /year/site/iop directory tree. Also use narrow Google search.
Moving on to … Results-based searching • An interface of “Statistical Views” (or data) under development for the ARM Archive. • Not all users want “data.” User interface to select thumbnails of Statistical Views Detailed view of graph; options to order statistics, data, or data files.
Understanding • Definition • The interpretation of the full context of the information. • Requirements • Descriptive metadata that correctly “matches up” information that was: • Generated from a variety of sources, • Collected for a variety of purposes, • Retained over a broad range of time. • “Understanding” applies to both: • Persons who read documentation. • Computers that “read” the data format. • Issues • “Language barriers” must be overcome between scientific disciplines. • Inadequate documentation and software can make “data” useless. • Additional effort will need to be allocated beyond original purpose. • Trade off between: current quantity of measurements and future use.
Sequence of Information Birth words, words units method Parameter def. lab field Method def. method Units def. parameter name Units media date words, words. QA def. Record system QA flag Measurement records generator sample ID location date GIS org.type name custodian address, etc. coord. elev. type depth Sample def. type date location generator
Support • Definition • Providing help and service beyond the creation of initial information and documentation. • Requirements • Answers user questions beyond the initial documentation. • Responds to the evolution of information technology. • Includes scientific and technology expertise. • Issues • Maintaining information does not: • Fit traditional science program planning. • Contain “whiz bang” appeal. • Requires development of new “career pathways”.
Research Implies Change … Research This is not always true for other information systems. repeat… Discovery New data requirements New questions
Issues to Consider about Change • What will change? • Which changes can be controlled? • How are changes approved? • How are users notified about changes? • How and when can changes be “smoothed” in the cumulative “Archive” view?
EPDUS = ????? (1) • Existence • Technology has pushed this out of control – a path to chaos • Dilution of value causes a recovery problem • Develop procedures for retention guidelines • Permission • Plans to that encourage permanent access of scientific data are a “management responsibility” • Consistent rules to protect privacy, resources, and propriety • Discovery • Significant effort on cataloging and searching • Large scale data collections depend on rational metadata • Need interrelated discovery pathways (query, catalog, pictures) • Results-based views are still very limited from large scale data • Inspiration is “an undeveloped frontier”
EPDUS = ????? (2) • Understanding • Expanding human and computer “interpretation” is difficult; • Does not keep up increase in diversity of information types • Web documentation has an inverted outline of scientific publications • Web users don’t read !!! • (??!!! More later !!!??) • Support • Inclusion of scientific expertise in Data Centers is still debated and limited • Programmatic justification of Data Centers outside of (or after!!) measurement program has limited “sponsor appeal”
Science Publication Abstract Introduction Literature review Materials / methods Results Discussion Conclusion References “Web Reading” Conclusions Results Abstract Materials / methods Discussion Literature review References Introduction “Inverted” Documentation Outline!?! Science Publication vs. “Web Reading” Reference: McCord (200?) ???
Cross cutting issues • Training about scientific data management • Some for all scientists, graduate program for “data scientists” • Reward system for scientific data “reuse” • Feudal relationship between more “Science” and data preservation • More measurements and experiments • Bigger computers driving science • Stop it!! Cooperation is needed!! • Scientific input needed for: • Metadata creation • Mesh with scientific planning • Defining priorities and recommendations • “An answer is better than NO answer!!” (Going for 0 points??!!) • Defining a reasonable boundary between “system and scientist” • Handshake needed for QA review, analysis tools, documentation, and automated discovery (?!?)
Looking from the Past to the Future(Common questions from my peers.) Interactive computing starts • “Should I computerize my data?” (~1974-1975) • “Should I save my {computerized} data?” (~1978-1980) • “Why would anyone want my data?” (~1980) • “Can anyone else properly understand my data?” (~1985) • “Can I have your data?” (~1990) • “Can I find your data?” (~1993-1994) • “Will I have to contact you to know how you used your data?” (~1998-1999) • “Can you tell me who else has used your data?” (~2000 - ????) • “Can you tell me where to find similar data?” (~2003 - ????) • “Do you want to know (or get back) how I used your data?” (~2005 - ????) • “Will you work together with me on ‘our’ data?” (????) • “Can we work together with our and ‘their’ data?” (????) • … What next … ?? PC gets common Internet “premie” www.??? takes off Cheap storage Collaboration is common
Conclusions (again!!) • We are swamped with more information than we can access* • *Access is a broad topic (EPDUS = ????) • Our current practices may not sustainable and reliable • Exponential vs linear capacity increases • Optimization is unbalanced • Scientific expertise within data centers will improve future data access. • Science and data management must be integrated. • Many solutions are NOT technological, but behaviorial. • Think - Training • “Data Science” training must developed and implemented. • The needed changes will not happen by accident. • “My ~30 years of experience and systems observation suggests otherwise!!” • Sooner is better than later.
An Ending to the Story…(More conclusions) • The best house is probably a combination of: • Bricks to build on • Sticks (wood) to bend and change with • Straw to rest on when sorting out the problem is too early. • The Wolf needs to be tamed with: • Reduction in needless data management and documentation chaos and uninformed practices. • More thought (research??) about “our appetite” (priorities) for storage and retention. • The Wolf and Pigs both need more training!! • And they live happily ever after…!
Pop Quiz (Answer 1) • What is “access combination” to my Lock? • Hints + missing hints • “I love it” • Decode numeric sequence from word lengths • {X=Yz – ((Y*z)/12)} • All unknowns are integers • Solution is the integer number showing the sequence • Combination = 142
Both CRY when given the “wrong one” Pop Quiz (Answer 2) • How is my necktie related to: • Data? • They all look alike at first? • Metadata? • Neckties distinguish the teddy bears • Scientists? • They distinguish data in varying and “unseen” ways • 2 year old children? • They distinguish teddy bears in varying and “unseen” ways
Pop Quiz (Answer 3) • “Why do I care?” • Better data access can turbo charge Science. • Things are a bigger mess than necessary. • Progress toward improvement is too passive and too slow. • Independently managing information from each project is like “paying rent” rather than “building equity.”
References • Information about my current project • Atmospheric Radiation Measurement (ARM) Program www.arm.gov • ARM Archive www.archive.arm.gov • Extended version of “The Three Little Pigs” • http://math-www.upb.de/~odenbach/pigs/pigs.html • Linked to a German Math professor’s web site?? • An English version is presented • Very good reference on Data, Science, and the need for new roles • “Long-Lived Digital Data Collections Enabling Research and Education in the 21st Century” • http://www.nsf.gov/pubs/2005/nsb0540/ • Sponsored by NSF National Science Board • Reports on other NSF cyber infrastructure activities to watch and encourage • http://www.nsf.gov/od/oci/reports.jsp