1 / 25

Tera/Petabyte data distribution architectures

Tera/Petabyte data distribution architectures. Chris A. Mattmann USC-CSE Annual Research Review Sunday, October 26, 2014. Outline. Research Problem and Importance Background and Related Work Problem Statement Approach Evaluation Strategy Conclusions. Research Problem and Importance.

Download Presentation

Tera/Petabyte data distribution architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tera/Petabyte data distribution architectures Chris A. Mattmann USC-CSE Annual Research Review Sunday, October 26, 2014

  2. Outline • Research Problem and Importance • Background and Related Work • Problem Statement • Approach • Evaluation Strategy • Conclusions MATTMANN-ARR

  3. Research Problem and Importance • Volume of data returned from scientific experiments and media content providers growing rapidly • Planetary Data System • Current: 20 terabytes for all NASA missions • Growing to: over 200terabytes from a single mission! • Orbiting Carbon Observatory • Current: hundreds of gigabytes to a single terabyte • Growing to: over 150 terabytes! * Projected as of 1/11/04 MATTMANN-ARR

  4. Research Problem and Importance • National Cancer Institute’s Early Detection Research Network (EDRN) • Current: tens of gigabytes to hundreds of gigabytes • Growing to: hundreds of gigabytes to terabytes • Question: how to distribute these voluminous data sets? MATTMANN-ARR

  5. Distributing Large Volumes of Data • Use existing infrastructure? • HTTP/REST? • Issues: • Scalability? • Single entrypoint? • Limited bandwidth? • What about other distribution mechanisms? RMI SOAP GridFTP MATTMANN-ARR

  6. Distributing Large Volumes of Data • Few data movement mechanisms in place for scientists, students, educators, etc. to get their data • EDRN: HTTP/REST • National Space Science Data Archive: FTP • Physical Oceanography Data Active Archive Center: FTP, and Aspera commercial UDP technology • Even Google: HTTP/REST, SOAP • Even when there are many mechanisms in place, how do we select the correct one? • Sometimes, we may even need to use them in concert • Certain users may only be able to get data from GridFTP, while others may require HTTP/REST • HTTP combined with a UDP based mechanism may speed up the transfer MATTMANN-ARR

  7. Distributing Large Volumes of Data • Understanding the Tradeoffs • HTTP/REST isn’t all bad: it’s pervasive, it’s ubiquitous, it’s a standard • It’s good in many situations, but not all situations • Same goes for many of the other distribution mechanisms • RMI scalable, but ties you to java, Peer-to-Peer highly scalable and efficient, but may neglect dependability and consistency • Understanding how many different data movement technologies there are: • GridFTP, Aspera software, HTTP/REST, RMI, CORBA, SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MW • …and that’s just off the top of my head! • Understanding the classes of data movement technologies MATTMANN-ARR

  8. Software Architecture • The definition of a system in the form of its canonical building blocks • Software Components: the computational units in the system • Software Connectors: the communications and interactions between software components • Software Configurations: arrangements of components and connectors and the rules that guide their composition MATTMANN-ARR

  9. A Software Architectural View of the Data Distribution Problem • …Understanding the architectures of existing data systems MATTMANN-ARR

  10. A Software Architectural View of the Data Distribution Problem • …Deciding the appropriate software connectors for data distribution (and their combinations) to use MATTMANN-ARR

  11. A Software Architectural View of the Data Distribution Problem • …Satisfying specified user scenarios for data distribution MATTMANN-ARR

  12. A Software Architectural View of the Data Distribution Problem • …Making these people happy! MATTMANN-ARR

  13. Research Question • What types of software connectors are best suited for delivering these huge amounts of data to the users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems? MATTMANN-ARR

  14. Problem Statement • Identifying and selecting suitable software connectors for data distribution* that satisfy user specified constraints • Use eight key dimensions of data distribution • Literature review • Our own experience in the context of planetary science and cancer research at JPL • User specified constraints on eight dimensions are data distribution scenarios • Identification of four basic distribution connector classes • RPC, P2P, Grid, Event-based • What classes are appropriate for which distribution scenarios? * Referred to as “distribution connectors” or “data distribution connectors” MATTMANN-ARR

  15. Eight Dimensions of Data Distribution MATTMANN-ARR

  16. Eight Dimensions of Data Distribution • Total Volume - the total amount of data that needs to be transferred from providers of data to consumers of data. • Number of Delivery Intervals - the number, size and frequency (timing) of intervals that the volume of data should be delivered within. • Performance Requirements - any constraints and requirements on the scalability, efficiency, consistency, and dependability of the distribution scenario. • Number of Users - the amount of unique users that the data volume needs to be delivered to. • Number of User Types - the amount of unique user types, such as scientists, or students, that the data volume needs to be delivered to. • Data Types - The number of different data types that are part of the total volume to be delivered. • Geographic Distribution - The geographic distribution of the data providers and consumers. • Access Policies - The number and types of access policies in place at each producer and consumer of data. MATTMANN-ARR

  17. Approach Classification Categorization Integration Testing/Evaluation MATTMANN-ARR

  18. Evaluation Strategy • Empirical evaluation using real world systems • NASA Planetary Data System • NASA Orbiting Carbon Observatory Mission • National Cancer Institute’s Early Detection Research Network • Quantifiably measure • consistency (data delivered is data sent) • efficiency (memory footprint and data throughput) • scalability (data volume and number of hosts • dependability (uptime, number of faults) • Compare to off-the-shelf connector solutions • OODT, GridFTP, Aspera, UFTP, Bittorrent, possibly more MATTMANN-ARR

  19. Current Progress • Preliminary Study with NASA’s Planetary Data System • Classified and Compared Data Movement Technologies • Parallel TCP/IP technologies • GridFTP, bbFTP • UDP bursting technologies • Aspera, UFTP • Baseline technologies • SCP, FTP, HTTP MATTMANN-ARR

  20. Experimental Results • Classified and Evaluated each technology against data distribution dimensions • Measured transfer rate • LAN-based • WAN-based • Varied dataset sizes from 10s of MBs to 10s of GBs • Ease to operate, easeto install • UDP technologies not testable on WAN(firewall, security, ease to configure) * GridFTP (blue), bbFTP (red), FTP (green) MATTMANN-ARR

  21. Conclusions • Proposed approach for classifying, selecting and evaluating different software connectors for data distribution • Preliminary results suggest parallel TCP/IP technologies beneficial in real world system (PDS) • Currently formalizing connector metadata and developing connector XML profiles MATTMANN-ARR

  22. Questions? • Thanks for your attention! MATTMANN-ARR

  23. Backup

  24. Refereed Papers • C. Mattmann, S. Kelly, D. Crichton, S. Hughes, S. Hardman, P. Ramirez and R. Joynger. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of NASA/IEEE Conference on Mass Storage Systems and Technologies, May 2006. • C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications. In Proceedings ofICSE, Shanghai, China, May 20th-28th, 2006. • N. Medvidovic and C. Mattmann. The GridLite DREAM: Bringing the Grid to Your Pocket. In Proceedings of the Monterey Workshop on Networked Systems, Irvine, CA, September, 2005. • C. Mattmann, N. Medvidovic, P. Ramirez and V. Jakobac. Unlocking the Grid. In Proceedings of the 8th ACM SIGSOFT International Symposium on Component-based Software Engineering (CBSE8), pp. 322-336. LNCS 3489, St. Louis, Missouri, May 14th-15th, 2005. • C. Mattmann, S. Malek, N. Beckman, M. Mikic-Rakic, N. Medvidovic and D. Crichton. GLIDE: A Grid-based, Lightweight, Infrastructure for Data-intensive Environments. In Proceedings of the European Grid Conference (EGC2005), pp. 68-77. LNCS 3470, Amsterdam, The Netherlands, February 14-16, 2005. MATTMANN-ARR

  25. Refereed Papers • J. Steven Hughes, D. Crichton, S. Kelly, C. Mattmann, R. Joyner, J. Wilf and J. Crichton. A Planetary Data System for the 2006 Mars Reconnaissance Orbiter Era and Beyond. In Proceedings of the 2nd ESA Symposium on Ensuring the Long Term Preservation and Adding Value to Scientific and Technical Data (PV-2004). Frascati, Italy, October 5-7, 2004. • C. Mattmann, D. Crichton, J.S. Hughes, S. Kelly and P. Ramirez. Software Architecture for Large scale, Distributed, Data-Intensive Systems. In Proceedings of the 4th IEEE/IFIP Working Conference on Software Architecture (WICSA-4), pp. 255-264. Oslo, Norway, June 12th-15th, 2004. • C. Mattmann, P. Ramirez, D. Crichton and J.S. Hughes. Packaging Data Products using Data Grid Middleware for Deep Space Mission Systems. In Proceedings of the 8th International Conference on Space Operations (Spaceops-2004), AIAA Press. Montreal, Canada, May 2004. MATTMANN-ARR

More Related