1 / 40

Rethinking how we provide science IT in an era of massive data but modest budgets Ian Foster

Rethinking how we provide science IT in an era of massive data but modest budgets Ian Foster. Exploding data volumes in biology. x 10 7 in 14 years. Exploding data volumes in astronomy. MACHO et al.: 1 TB Palomar: 3 TB 2MASS: 10 TB GALEX: 30 TB Sloan: 40 TB. 100,000 TB.

eldora
Download Presentation

Rethinking how we provide science IT in an era of massive data but modest budgets Ian Foster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rethinking how we provide science IT in an era of massive data butmodest budgetsIan Foster

  2. Exploding data volumes in biology x107 in 14 years

  3. Exploding data volumes in astronomy MACHO et al.: 1 TB Palomar: 3 TB 2MASS: 10 TB GALEX: 30 TB Sloan: 40 TB 100,000 TB Pan-STARRS: 40,000 TB

  4. Exploding data volumes in climate science 2004: 36 TB 2012: 2,300 TB Climate model intercomparison project (CMIP) of the IPCC

  5. The challenge of staying competitive "Well, in our country," said Alice … "you'd generally get to somewhere else — if you run very fast for a long time, as we've been doing.” "A slow sort of country!" said the Queen. "Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!"

  6. Ways of running faster (1) Civilization advances by extending the number of important operations which we can perform without thinking about them Alfred North Whitehead, 1911 Enhance human capabilities

  7. Ways of running faster (2) Utility computing “[t]he computing utility could become the basis for a new and important industry” – McCarthy, 1960 Grid computing “provide access to computing on demand” – The Grid: Blueprint for a New Computing Inf., 1999 Cloud computing “delivery of computing as a service rather than a product” [Wikipedia, 2012] Outsource automatabletasks Enhance human capabilities

  8. Ways of running faster (3) Collaboratories, P2P, crowdsourcing Virtual organizations “flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources”, Anatomy of Grid, 2001 Outsource automatabletasks Join forces with others Enhance human capabilities

  9. Big science has been keeping up OSG: 1.4M CPU-hours/day, >90 sites, >3000 users, >260 pubs in 2010 LIGO: 1 PB data in last science run, distributed worldwide Robust production solutions Substantial teams and expense Sustained, multi-year effort Application-specific solutions, built on common technology ESG: 1.2 PB climate data delivered to 23,000 users; 600+ pubs

  10. But small science is struggling More data, more complex data Ad-hoc solutions Inadequate software, hardware Data plan mandates

  11. Mediumscience struggles too Blanco 4m on Cerro Tololo Image credit: Roger Smith/NOAO/AURA/NSF • Dark Energy Survey receives 100,000 files each night in Illinois • They transmit files to Texas for analysis … then move results back to Illinois • Process must be reliable, routine, and efficient • The IT team is not large

  12. Science IT crisis demands new approaches • We have exceptional infrastructure for the 1% (e.g., supercomputers, LHC, …) • But not for the 99% (e.g., the vast majority of the 1.8M publicly funded researchers in the EU) We need new approaches to providing science IT, that: — Reduce barriers to entry — Are cheaper — Are sustainable

  13. You can run a company from a coffee shop

  14. Because businesses outsource their IT Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Software as a Service (SaaS)

  15. And often their large-scale computing too Web presence Email (hosted Exchange) Calendar Telephony (hosted VOIP) Human resources and payroll Accounting Customer relationship mgmt Data analytics Content distribution Software as a Service (SaaS) Infrastructure as a Service(IaaS)

  16. Consumers also outsource much of their IT

  17. Let’s rethink how we provide research IT Accelerate discovery and innovation worldwide by providing research IT as a service Leverage software-as-a-service to • provide millions of researchers with unprecedented access to powerful tools; • enable a massive shortening of cycle times intime-consuming research processes; and • reduce research IT costs dramatically via economies of scale—and address sustainability?

  18. Also address administrative costs? 42% of the time spent by an average PI on a federally funded research project was reported to be expended on administrative tasks related to that project rather than on research — Federal Demonstration Partnership faculty burden survey, 2007

  19. Time-consuming tasks in science • Run experiments • Collect data • Manage data • Move data • Acquire computers • Analyze data • Run simulations • Compare experiment with simulation • Search the literature • Communicate with colleagues • Publish papers • Find, configure, install relevant software • Find, access, analyze relevant data • Order supplies • Write proposals • Write reports • …

  20. Time-consuming tasks in science • Run experiments • Collect data • Manage data • Move data • Acquire computers • Analyze data • Run simulations • Compare experiment with simulation • Search the literature • Communicate with colleagues • Publish papers • Find, configure, install relevant software • Find, access, analyze relevant data • Order supplies • Write proposals • Write reports • …

  21. 1980 Scientific data delivery, 2012 • “[A] majority of users at BES facilities … physically transport data to a home institution using portable media … data volumes are going to increase significantly in the next few years (to 70 TB/day or more) – data must be transferred over the network” • “the effectiveness of data transfer middleware [is] not just on the transfer speed, but also the time and interruption to other work required to supervise and check on the success of large data transfers” • “It took two weeks and email traffic between network specialists at NERSC and ORNL, sys-admins at NERSC, … and combustion staff at ORNL and SNL to move 10 TB from NERSC to ORNL” Major usability, productivity, performance problems [ESNet Network Requirements Workshops, 2007-2010]

  22. The challenge: Moving big data easily What should be trivial … “I need my data over there – at my _____” ( supercomputing center, campus server, etc.) Data Source Data Destination … can be painfully tedious and time-consuming “GAAAH!%&@#&” ! Config issues Data Source Data Destination ! Firewall issues ! Unexpected failure = manual retry

  23. GO PICTURE

  24. GO-Transfer: Data transfer as SaaS • Reliable file transfer. • Easy “fire-and-forget” transfers • Automatic fault recovery • High performance • Across multiple security domains • No IT required. • Software as a Service (SaaS) • No client software installation • New features automatically available • Consolidated support & troubleshooting • Works with existing GridFTP servers • Globus Connect solves “last mile problem” GO-Transfer is the initial offering of the US National Science Foundation’s XSEDE User Access Services (XUAS)

  25. Globus Online’sSaaS/Web 2.0 architecture Command line interface lsalcf#dtn:/ scpalcf#dtn:/myfile \ nersc#dtn:/myfile HTTP REST interface POST https://transfer.api.globusonline.org/ v0.10/transfer <transfer-doc> Web interface OpenID OAuth Shibboleth (Operate) Fire-and-forget data movement Automatic fault recovery High performance No client software install Across multiple security domains (Hosted on) GridFTP servers FTP servers Other protocols: HTTP, WebDAV, SRM, … Globus Connect on local computers

  26. Statistics and user feedback • Launched November 2010 >3500 users registered >2500 TB user data moved >130 million user files moved >300 endpoints registered • Widely used on TeraGrid/XSEDE; other centers & facilities; internationally • >20x faster than SCP • Comparable to hand-tuned “Last time I needed to fetch 100,000 files from NERSC, a graduate student babysat the process for a month.” “I expected to spend four weeks writing code to manage my data transfers; with Globus Online, I was up and running in five minutes.” “Transferred my data in 20 minutes instead of 61 hours. Makes these global climate simulations manageable.”

  27. Common research data management steps • Dark Energy Survey • Galaxy genomics • LIGO observatory • SBGrid structural biology consortium • NCAR climate data applications • Land use change; economics

  28. Towards “research IT as a service”

  29. Research data management as a service • GO-Store • Access to campus, cloud, XSEDE storage • GO-Catalog • On-demand metadata catalogs • GO-Compute • Access to computers • GO-Galaxy • Share, create, run workflows Today Prototype • GO-User • Credentials and other profile information • GO-Transfer • Data movement • GO-Team • Group membership • GO-Collaborate • Connect to collaborative tools: Jira, Confluence, … Beta

  30. Collaboration Management

  31. SaaS services in action: The XSEDE vision XUAS

  32. Other innovative science SaaS projects

  33. Other innovative science SaaS projects

  34. Other innovative science SaaS projects

  35. Other innovative science SaaS projects

  36. SaaS economics: A quick tutorial • Lower per-user cost (x10 or more?) via aggregation onto common infrastructure • Initial “cost trough” due to fixed costs • Per-user revenue permits positive return to scale • Further reduce per-user cost over time $ 0 Time Lower per-user costs suggest new approaches to sustainability

  37. A 21st C science IT infrastructure strategy Small and medium laboratories and projects • To providemore capability formore people at less cost … • Create infrastructure • Robust and universal • Economies of scale • Positive returns to scale • Via the creative use of • Aggregation (“cloud”) • Federation (“grid”) P L L L L L L L L L P P P P L L L L L L L L L L L L L L L L L L aa S Research data management Collaboration, computation Research administration

  38. Acknowledgments • Colleagues at UChicago and Argonne • Steve Tuecke, Ravi Madduri, Kyle Chard, Tanu Malik, RachanaAnanthakrisnan, Raj Kettimuthu, and others listed at www.globusonline.org/about/goteam/ • Carl Kesselman and other colleagues at other institutions • Participants in the recent ICiS workshop on “Human-Computer Symbiosis: 50 Years On” • NSF OCIand MPS; DOE ASCR; NIH for support

  39. For more information • www.globusonline.org; Twitter: @globusonline • Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. • Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R., Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S. Software as a Service for Data Scientists. Communications of the ACM, Feb, 2012.

  40. Thank you! foster@uchicago.edu www.globusonline.org Twitter: @globusonline, @ianfoster

More Related