1 / 19

Proposed DataONE TeraGrid Joint Initiative

Proposed DataONE TeraGrid Joint Initiative. John Cobb, TeraGrid, and DataONE Presentation to TeraGrid Quarterly Management Meeting August 31, 2010 Seattle, WA. DataONE objectives.

yair
Download Presentation

Proposed DataONE TeraGrid Joint Initiative

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proposed DataONE TeraGrid Joint Initiative John Cobb, TeraGrid, and DataONE Presentation to TeraGrid Quarterly Management Meeting August 31, 2010 Seattle, WA

  2. DataONE objectives • Develop a distributed cyberinfrastructure architecture to enable the long term preservation of digital data: support the data life cycle • Engage the scientific community to move forward concepts of • Digital data archives of scholarly data • Best practices for digital data preservation • Engage journal publishers’ efforts for digital data repositories (e.g. Dryad) • Enable new science via data synthesis • Develop a long-term sustainability strategy – decades long • Architecture • Technology future-proofing • Arrangements/MOU’s • Focus on ecological, biological, environmental science areas.

  3. What shapes DataONE? • Challenges associated with climate variability • Community needs good data • Good data • builds good science • makes possible wise management • enables sound decisions • Good data needs • good technical infrastructure • sound organization • community engagement (you)

  4. Architecture to support the data lifecycle ORC Node UCSB Node UNM Node } Deposition/acquisition/ingest Curation and metadata management Protection, including privacy Discovery, access, use, and dissemination Interoperability, standards, and integration Evaluation, analysis, and visualization The data lifecycle

  5. DataONE – Building new global CI Additional prospective member nodes under discussion

  6. The Character of a member node • Source of data • Participant in a larger collective • (Usually) provides new and interesting data sets (watersheds, satellite remote observations, citizen science data collections, environmental observations, geographical diversity, specific diversity, discipline diversity • Supports DataONE Member Node (MN) software stack • May contribute storage to support replicas of other member nodes • May differ in scale • My data • University library digital services arm • Associated data repositories for journals • DOI infrastructure • Project specific data collections • Agency specific programs for data management • National scale cyberinfrastructure providers (i.e. TG)

  7. The Metadata challenge“the flood of increasingly heterogeneous data” • Data are heterogeneous • Syntax • (format) • Schema • (model) • Semantics • (meaning) DataONE Focus: Synthesize data sets with disparate metadata to provide new scientific insights Jones et al. 2007

  8. DataONE Member Node Operations • Minimal set of operations to enable a distributed archive • Minimal to enable wide deployment in heterogeneous environment • Does not include some operations that are Coordinating node only • That set = {C,R,U,D} • Create • Replicate • Update • Delete • Implementation • Pilot now (operational and operational) • Eval. of Pilot started • V.1 deploy planned next yr. • Deployed platforms • Python • R • Mercury • … • Note the meaning of “platform”

  9. Coordinating Nodes • Contains full metadata catalog of member node data collections • Directs certain operations • Replication direction • Location tracking • Ingestion • Assisted by deployed platforms. Ex. Mercury leads to automatic ingest capability for NASA DAAC (MODIS data) • CN locations also have MN instances. Provides some “free energy” for replication

  10. Service layer model of data/knowledge services (Analogy with OSI) • Platters • Controllers • Hardware redundancy • I/O Bandwidth provisioning • Connections • File systems • AAAA • Federated Identity • Wide area data distribution • Block level • Xnodes • File level • Metadata generation (Automatically?) • Metadata harmonization • Replication, decoherent, survivable copies • Workflow mediated data operations • Semantics and ontology

  11. Natural TG and DataONE interaction • TG emphasizes left column • DataONE emphasizes right column --- for areas of interest. • DataONE MN collective resembles part of old TeraGrid collections mission • DataONE includes large community engagement component with the hope of generating sufficient interest for collected communities to sustain interest (c.f. well attended data best practices tutorial at 2010 Ecological Society of Am. meeting

  12. Proposed interaction • For DataONE: TeraGrid RP’s (XD Sp’s) as Member nodes • For TeraGrid: DataONE as a data oriented Science Gateway • Requirements: • For DataONE: • Participate in TG activities • SciGwy efforts • Some of TG’s distributed data efforts • Some of TG outreach • Request data allocations • TeraGrid RP’s: • Deploy DataONE MN services • Make MN services available as REST services (advsertised SW IIS) • Both: • Interact • Investigate “new opportunities”

  13. What about XD? • TeraGrid is “Pre-XD” • Does XD have a data archive mission? • yes (as far as I know now) • All things Digital, but eXtreme The goal of this solicitation is to encourage innovation in the design and implementation of an effective, efficient, increasingly virtualized approach to the provision of high-end digital services – extreme digital services - while ensuring that the infrastructure continues to deliver high-quality access for the many researchers and educators that use it in their work. • Conclusion: work with current TeraGrid and plan to manage a smooth transition to XD (DataONE will need to be capable of this pivot if it hopes to have decades long stewardship) • Go ahead and get started now

  14. Sustainability • DataONE is called to create an environment for “decades long” sustainability – technically and economically • No project has more than a 5 year horizon (not even NASA archives) • Datanet’s must “figure this out” • Solution: plan to manage change • Recognize the underlying forces. Science wants data preservation • “someone will provide” (More detail needed here)

  15. What is the Value add? • Helps TG and DataONE meet their respective goals • Providing cyberinfrastructure for NSF funded research • Providing curation and life cycle support for digital data archives • Diminishes DataONE need to provision large amounts of low level data resources internally – partner instead of re-invent • Re-iterates TeraGrid/XD mission to provide tier 2 (and tier 1) resources for storage

  16. Next steps/action items • Commission a combined TG+D1 WG • Goals • Develop TG RP’s as DataONE meber nodes • Action Items • DataONE All hands meeting Nov. 2-5 Tamaya, NM • Initiate DataONE SGW • Initial TG allocation • Deploy pilot MN stack on TG resources • Demonstrate CN orchestrated replication to TG MN’s – exercise the CRUD services • Composition • TeraGrid • Chris Jordan – TG AD for Data • Nancy Wilkins-Diehr – TG AD for SGW • Dan Katz – TG Dir. Of Science • Others? • DataONE • Dave Vieglais, DataONE AD for CI • John Cobb, Dist Storage WG lead • Bruce Wilson, DataONE core cyberinfrastructure team (CCIT) • Others

  17. Where are future opportunities? • MN replication can be viewed as data placement. Thus DataONE can be a data staging method for large scale computations on TG/XD • Metadata harmonization can imply moderate to large regular computations (“daily farm fresh” data-sets may require daily data/computation workflows) • “Noodle out” how to support NSF data management plan requirement, perhaps together • Ability to integrate with MRE’s as a ready data management solution • Ability to integrate with similar simulation efforts (much more data intensive)

  18. Discussion/Questions? Cobbjw@ornl.gov 865.576.5439

  19. Post discussion action items • Smaller team continue discussions (Cobb, Jordan, Katz, Wilkins-Diehr, Vieglais, Wilson, Jones) • Bundle pilot MN SW for TG MN deployment • Identify MN listening ports for services • Initiate Security WG • Initiate Gateway project • Define RP’s willing to deploy these services • DataONE to write TG allocation request • Gateway services • Replicated Data Service • Continue larger discussion, particularly as larger needs come down the line • Explore mutual line of business opportunities • Separately: continue to investigate economic sustainability of large scale storage needs

More Related