1 / 22

Workshop Organizing Committee: Rosalind R. James—Carolyn Lawrence

111101010100010100101010100110110111010101010101111011101010101010111000110 1101010100010101. Big Data Computing: Building a Vision for ARS Information Management Feb. 5-7, 2012 GWCC, Beltsville, MD . Workshop Organizing Committee: Rosalind R. James—Carolyn Lawrence

abril
Download Presentation

Workshop Organizing Committee: Rosalind R. James—Carolyn Lawrence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 11110101010001010010101010011011011101010101010111101110101010101011100011011010101000101011111010101000101001010101001101101110101010101011110111010101010101110001101101010100010101 Big Data Computing: Building a Vision for ARS Information ManagementFeb. 5-7, 2012GWCC, Beltsville, MD Workshop Organizing Committee: Rosalind R. James—Carolyn Lawrence —Sharon Papiernik—Curt Van Tassell

  2. Workshop Purpose Bring ARS scientific capability to the cutting edge

  3. Workshop Purpose Develop a vision and strategy that defines: (1) ARS scientific Big Data needs (2) An infrastructure for dealing with these needs for now and into the future

  4. What is Big Data? Massive amounts of data that collect over time that are difficult to analyze and handle using common data management tools.

  5. Size Isn’t Everything… • Big Data comes in V-Dimensions: • Volume. With large size comes difficulty in finding what is relevant, space to store it, and how to index it • Variety. Highly structured data, variability structured data, and unstructured data • Velocity.How fast is the data created, and how fast must it be processed? • Veracity.Uncertain or imprecise data.

  6. What makes Big Data so important? • Researchers no longer simply ask, “What experimental design will best address this question?” • But rather, “What can I glean from extant data?” • Or better yet, “What insights can I glean if I could fuse data from multiple domains?” From: The Fourth Paradigm: Data-Intensive Scientific Discovery

  7. We are drowning in information…The world will be run by synthesizers, people able to put together the right information at the right time, think critically about it, and make important choices wisely. EO Wilson. 1998. Consilience, The Utility of Knowledge

  8. Scientific computing is becoming increasingly data intensive. We are becoming increasingly able to Answer previously intractable questions, More efficiently solve problems, Characterize the natural world to a greater level of detail

  9. An era of large datasets • Large Hadron Collider • 15 Pbytes/year (15 x 106 Gbytes, 15 x 103 Tbytes) • Pan-STARRS (panoramic survey telescope) • 2Gbytes per image, taken every 30 sec from 4 cameras • Several Tbytes/night/telescope • Natl. Human Genome Research Institute • 1000 genomes = 200 Tbytes • Beijing Genomics Institute • 5 Tbytes/day

  10. GenBank Sequence Growth (to 2008)

  11. What it takes to move Big Data • 1Gbyte data • T1 line: 1.5 hrs • Thin Ethernet: 14 min • Fast Ethernet: 1 min • 1 Tbyte data • T1 line: 65 days, 22.5 hrs • Thin ethernet: 10 days, 4.3 hrs • Fast ethernet: 1 day, 0.5 hrs • Gig-E: 2 hrs, 26 min.

  12. Moving into the cloud • Scientists need to be able to move and share large datasets. • Cloud/Cluster/Grid computing. • Not just for holding data, but for computations • Reduce the need to repeatedly move the same datasets.

  13. Libraries: Provide access and dissemination of information…

  14. Existing Systems for Handling Big Data • XCEDE (replaces TeraGrid) • A virtual system that scientists can use to interactively share super computer resources, data, & expertise • Composite of several university advanced computer centers • iPlant (Texas Advanced Computing Center) • Plant genomic data • Cyber infrastructure for the transfer, storage, analysis, visualization, meta-data control, discovery, etc. • Cloud computing

  15. Existing Big Data Systems (cont.) • Three Rivers Optical Exchange (part of XCEDE) • Amazon Cloud Computing • Purchase computing power and storage, as needed • John Wesley Powell Center for Analysis & Synthesis • USGS • Earth sciences issues • “Enhancing scientific discovery & problem solving through integrated research.” • European grid systems • Watson (?)

  16. ARS Could Provide Leadership for Agricultural Data OSTP Big Data Research and Development Initiative John Holdren (3/29/2012) • The government is under investing in data management • The process of going from data knowledge understanding is being inhibited • Human capital needs • People with deep analytical skills, • Data-savvy managers/executives • Greater IT savvy technicians, for both structured and unstructured data

  17. What does ARS have to add? • Decision support software operate from a cloud system • Public databases could be better organized and more easily accessible, collectively • Large data • Currently wasting money on redundant hardware • And software • Currently have difficulty moving the data • Cloud systems facilitate fusing datasets • ARS capable of long-term stability for storage, analyses

  18. Thus this Workshop Will • Gather together ARS scientists • who are already working with large data • or with experience and knowledge of our current database collections • or who are trying to work with Big Data • Include speakers familiar with Big Scientific Data issues, who have developed solutions • Develop a Vision for what an ARS solution should look like.

  19. Outcome of the Workshop A white paper describing a vision for ARS Big Data, including examples of current needs and an infrastructure for meeting current and future needs. This infrastructure will include IT resources Intellectual resources Personnel resources

  20. Recipients of the Information • ARS Administrators (AC Council) • ARS Office National Programs • OCIO and IT Specialists in the Field • ARS Scientific Staff (scientists, technicians, computational biologists, statisticians)

  21. The climb is steep, but there are cairns along the way.

  22. Thank you!

More Related