1 / 84

Foundations VII: Data life-cycle, Mining and Knowledge Discovery

Foundations VII: Data life-cycle, Mining and Knowledge Discovery. Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13, November 29, 2010. Contents. Review assignment More advanced topics; life cycle, mining and adding to your knowledge base Summary

debbie
Download Presentation

Foundations VII: Data life-cycle, Mining and Knowledge Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Foundations VII: Data life-cycle, Mining and Knowledge Discovery Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI-6962-01 Week 13, November 29, 2010

  2. Contents • Review assignment • More advanced topics; life cycle, mining and adding to your knowledge base • Summary • Next week (your presentations)

  3. Semantic Web Methodology and Technology Development Process • Establish and improve a well-defined methodology vision for Semantic Technology based application development • Leverage controlled vocabularies, et c. Adopt Technology Approach Leverage Technology Infrastructure Science/Expert Review & Iteration Rapid Prototype Open World: Evolve, Iterate, Redesign, Redeploy Use Tools Evaluation Analysis Use Case Develop model/ ontology Small Team, mixed skills

  4. Data->Information->Knowledge

  5. Data Life Cycle • Life cycle (we will define these shortly) • Acquisition, curation, preservation • Long term stewardship • Data and information – we use this to get to the discussion of knowledge • Content; the values • Context; the background, setting, etc. • Structure; organization and form • Representation/ storage • Analog • Digital (and born digital)

  6. Why it is important • 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: http://www.forbes.com/2002/09/12/0912data_print.html) • 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as file formats evolve,” Houston Chronicle, Jan. 16, 2003. [Online]. Available: http://www.chron.com/cs/CDA/story.hts/tech/1739675) • R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long-term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec. 2004. [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_International_Polar_Year:_Making_Data_and_Information_Available_for_the_Long_Term.ppt

  7. Why (cont’d) • e-science aims to derive new knowledge from (possibly) multiple sources data • The data needs to be persistent, available and usable • The rate of creation of knowledge representations is increasing; they are a representation of the known ‘facts’ based on the data • We studied KR creation, engineering, evolution and iteration • Knowledge needs a life-cycle as well

  8. At the heart of it • Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc. • Inability to know the inter-relations, assumptions and missing information • We’ll look at a (data) use case for this shortly • But first we will look at what, how and who in terms of the full life cycle

  9. What to collect? • Documentation • Metadata • Provenance • Ancillary Information • Knowledge

  10. Who does this? • Roles: • Data creator • Data analyst • Data manager • Data curator

  11. How it is done

  12. Acquisition

  13. Curation

  14. Preservation • Usually refers to the full life cycle • Archiving is a component • Stewardship is the act of preservation • Intent is that ‘you can open it any time in the future’ and that ‘it will be there’ • This involves steps that may not be conventionally thought of • Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations

  15. Some examples and experience • NASA • NOAA • Library community • Note: • Mostly in relation to publications, books, etc but some for data • Note that knowledge is in publications but the structure form is meant for humans not computers, despite advances in text analysis • Very little for the type of knowledge we are considering: in machine accessible form

  16. Back in the day... SEEDS Working Group on Data Lifecycle • Second Workshop Report • https://esdswg.eosdis.nasa.gov/documents/W2_Bothwell.pdf • Many LTA recommendations • Earth Sciences Data Lifecycle Report • https://esdswg.eosdis.nasa.gov/documents/lta_prelim_rprt2.pdf • Many lessons learned from USGS experience, plus some recommendations • SEEDS Final Report (2003) - Section 4 • https://esdswg.eosdis.nasa.gov/documents/FinRec.pdf • Final recommendations vis a vis data lifecycle MODIS Pilot Project • GES DISC, MODAPS, NOAA/CLASS, ESDIS effort • Transferred some MODIS Level 0 data to CLASS

  17. Mostly Technical Issues • Data Preservation • Bit-level integrity • Data readability • Documentation • Metadata • Semantics • Persistent Identifiers • Virtual Data Products • Lineage Persistence • Required ancillary data • Applicable standards

  18. Mostly Non-Technical Issues • Policy (constrained by money…) • Front end of the lifecycle • Long-term planning, data formats, documentation... • Governance and policy • Legal requirements • Archive to archive transitions • Money (intertwined with policy) • Cost-benefit trades • Long-term needs of NASA Science Programs • User input • Identifying likely users • Levels of service • Funding source and mechanism

  19. Use case: a real live one; deals mostly with structure and (some) content HDF4 Format "Maps"for Long Term Readability C. Lynnes, GES DISC R. Duerr and J. Crider, NSIDC M. Yang and P. Cao, The HDF Group HDF=Hierarchical Data Format NSIDC=National Snow and Ice Data Center GES=Goddard Earth Science DISC=Data and Information Service Center

  20. In the year 2025... A user of HDF-4 data will run into the following likely hurdles: • The HDF-4 API and utilities are no longer supported... • ...now that we are at HDF-7 • The archived API binary does not work on today's OS's • ...like Android 3.1 • The source does not compile on the current OS • ...or is it the compiler version, gcc v. 7.x? • The HDF spec is too complex to write a simple read program... • ...without re-creating much of the API What to do?

  21. HDF Mapping Files Concept:  create text-based "maps" of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now) • XML • Stored separately from, but close to the data files • Includes  • internal metadata • variable info • chunk-level info • byte offsets and length • linked blocks • compression information Task funded by ESDIS project •  The HDF Group, NSIDC and GES DISC

  22. Map sample (extract)         <hdf4:SDS objName="TotalCounts_A" objPath="/ascending/Data Fields" objID="xid-DFTAG_NDG-5">          <hdf4:Attribute name="_FillValue" ntDesc="16-bit signed integer">            0 0          </hdf4:Attribute>          <hdf4:Datatype dtypeClass="INT" dtypeSize="2" byteOrder="BE" />          <hdf4:Dataspace ndims="2">            180 360          </hdf4:Dataspace>          <hdf4:Datablock nblocks="1">            <hdf4:Block offset="27266625" nbytes="20582" compression="coder_type=DEFLATE" />          </hdf4:Datablock>        </hdf4:SDS>

  23. Status and Future Status • Map creation utility (part of HDF) • Prototype read programs • C • Perl • Paper in TGRS special issue • Inventory of HDF-4 data products within EOSDIS Possible Future Steps • Revise XML schema • Revise map utility and add to HDF baseline • Implement map creation and storage operationally • e.g., add to ECS or S4PA metadata files

  24. Examples of NASA context

  25. Contextual Information: • Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.) • Instrument/sensor calibration data and method • Processing algorithms and their scientific basis, including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product) • Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 25 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  26. Contextual Information (continued): • Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive • Quality assessment information • Validation record, including identification of validation data sets • Data structure and format, with definition of all parameters and fields • In the case of earth based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record • A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set • Information received back from users of the data set or product Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 26 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  27. However… • Even groups like NASA do not have a governance model for this work • Governance: defintion • Stakeholders: • NASA for integrity of their data holdings (is it their responsibility?) • Public for value for and return on investment • Scientists for future use (intended and un-intended) • Historians

  28. NOAA

  29. Library community • OAIS • OAI (PMH and ORE)

  30. Metadata Standards - PREMIS • Provide a core preservation metadata set with broad applicability across the digital preservation community • Developed by an OCLC and RLG sponsored international working group • Representatives from libraries, museums, archives, government, and the private sector. • Based on the OAIS reference model 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  31. Metadata Standards - PREMIS • Maintained by the Library of Congress • Editorial board with international membership • User community consulted on changes through the PREMIS Implementers Group • Version 1 was released in June 2005 • Version 2 was just released 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  32. PREMIS - Entity-Relationship Diagram Intellectual Entities “an action that involves atleast one object or agentknown to the preservationrepository” e.g., created, archived,migrated Rights “a person, organization, orsoftware program associatedwith preservation events inthe life of an object”e.g., Dr. Spock donated it “a discrete unit of information in digital form” For example, a data file “a coherent set of contentthat is reasonablydescribed as a unit” For example, a web site, data set or collection of data sets Objects Agents “assertions of one or more rights or permissionspertaining to an objector an agent” e.g., copywrite notice, legalstatute, deposit agreement Events 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  33. PREMIS - Types of Objects • Representation - “the set of files needed for a complete and reasonable rendition of an Intellectual Entity” • File • Bitstream - “contiguous or non-contiguous data within a file that has meaningful common properties for preservation purposes” 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  34. Metadata Standards - METS • Metadata Encoding and Transmission Standard • An initiative of the Digital Library Federation • Based on the Making of America II project 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  35. METS - What’s Its Purpose? • Provides the means to convey the metadata necessary for • management of digital objects within a repository • exchange of objects between repositories (or between repositories and their users) • Designed to facilitate • shared development of information management tools/services • interoperable exchange of digital materials 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  36. METS - What’s its status? • Version 1.6 was released in Sept. 2007 • Maintained by the Library of Congress • International Editorial Board • NISO registration as of 2006 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  37. Backup Materials - MODIS Contextual Info 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  38. Instrument/sensor characteristics Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 38 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  39. Processing Algorithms & Scientific Basis Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 39 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  40. Ancillary Data Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 40 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  41. Processing History including Source Code Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 41 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  42. Quality Assessment Information Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 42 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  43. Validation Information Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 43 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  44. Other Factors that can Influence the Record Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 44 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  45. Bibliography Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008 Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign 45 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  46. Information from users • Data Errors found • Quality updates • Things that need further explanation • Metadata updates/additions? • Community contributed metadata???? 7th Joint ESDSWG meeting, October 22, Philadelphia, PA Data Lifecycle Workshop sponsored by the Technology Infusion Working Group

  47. Back to why you need to… • E-science uses data and it needs to be around when what you create goes into service and you go on to something else • That’s why someone on the team must address life-cycle (data, information and knowledge – we’ll get to the latter shortly) and work with other team members to implement organizational, social and technical solutions to the requirements

  48. What would you need to do?

  49. (Digital) Object Identifiers • Object is used here so as not to pre-empt an implementation, e.g. resource, sample, data, catalog • Examples: • DOI • URI • XRI

  50. Versioning

More Related