320 likes | 532 Views
PDS Data Movement and Storage Planning (PMWG). PDS MC F2F UCLA Dan Crichton November 28-29, 2012. Growth of Planetary Data Archived from U.S. Solar System Research. Yes, size matters, but so does complexity…. Big Data Challenges. Storage Computation Movement of Data Heterogeneity
E N D
PDS Data Movement and Storage Planning (PMWG) PDS MC F2F UCLA Dan Crichton November 28-29, 2012
Growth of Planetary Data Archived from U.S. Solar System Research Yes, size matters, but so does complexity…
Big Data Challenges • Storage • Computation • Movement of Data • Heterogeneity • Distribution …can affect how we generate, manage, and analyze science data. …commodity computing can help, if architected correctly
Architecting PDS Towards a Decoupled Architecture Data Movement Computation Core PDS Data Movement Heterogeneous Data Transform Ingest Data Providers PDS Data Management Distribution Users Transform Improve efficiency and support to deliver high quality science products to PDS Preserve and ensure the stability and integrity of PDS data Improve user support and usability of the data in the archive Storage
Big Data Challenges • Storage • Computation • Movement of Data • Heterogeneity • Distribution …can affect how we generate, manage, and analyze science data.
Storage Eye Chart • Direct Attached Storage (DAS) • DAS based storage (usually disk or tape) is directly attached to internal server (point-to-point). • Network Attached Storage (NAS) • A NAS unit or “appliance” is a dedicated storage server connected to an Ethernet network that provides file-based data storage services to other devices on the network. NAS units remove the responsibility of file serving from other servers on the network. • Storage Area Network (SAN) • SAN is an architecture to connect detached storage devices, such as disk arrays, tape libraries, and optical jukeboxes, to servers in a way that the devices appear as local resources. • Redundant Array of Inexpensive Disks (RAID) • The concept of RAID is to combine multiple inexpensive disk drives into an array of disk drives which perform (usually) better then a single disk drive. The RAID array will appear as a single drive to the connected server. RAID technology is typically employed in a DAS, NAS, or SAN solution. • Cloud Storage • Cloud Storage involves storage capacity that is accessed through the internet or wide area network (WAN) , storage is usually purchased on an as-needed basis. Users can expand capacity on the fly. Providers operates a highly scalable storage infrastructure ,often in physically dispersed locations. • Solid State Drive Storage • Solid State Drive storage technology is evolving to a point where SSDs can, in some cases, start to supplant traditional storage. SSDs that use DRAM-based technology (volatile memory) cannot survive a power loss but flash-based SSDs (non-volatile), although slower then DRAM-based SSDs, do not require a battery backup and therefore become acceptable in the enterprise. It has recently been announced that 1TB SSDs are available for industrial applications, like military, medical and the like. SSD technology is rapidly evolving and in the near future will be a major contender in the storage arena.
Storage Architectural Concepts • Decentralized • In-house storage locally attached • Resource managed (procured, backed up, maintained, replenished) by locally • Centralized • Common storage at a central remote • But, not necessarily separation of data, catalog and services • Cloud • Storage as a virtual cloud infrastructure resourced over the Internet • Resource managed by a third party / organization
Cloud Deployment Models • Public Cloud: • Cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services (e.g. Amazon, RackSpace, Nirvanix) • Applications are typically “multi-tenant” and physical infrastructure is shared • Private Cloud: • Cloud infrastructure is operated solely for an organization. It may be managed by the organization or on their behalf by a third party and may exist on premise or at a provider’s site in a hosting center. Could be using cloud software (e.g., Eucalyptus) • Hybrid Cloud: • Organization provides and manages some resources in-house and has others provided externally • Possibility to leverage existing technologies and future technologies with minimal cost (e.g. backup/archive data managed externally, operational data managed internally) Photo credit: AcuteSys
Many Benefits of Cloud Computing Broad network access Accessible from anywhere Resource Pooling Shared pool of configurable computing resources; reliability through replicas, etc Rapid Elasticity Scale when needed with storage and services/cores, etc Measured Service Utility Computing, pay by the drink, rapidly provisioned
Challenges of Cloud Storage • Data Integrity • Ownership (local control, etc) • Security • ITAR • Data movement to/from cloud • Procurement • Cost arrangements
The Planetary Cloud Experiment • Utility to PDS • How does it fit PDS4 architecture • APIs • Decoupled storage and services • Data movement challenges? • Cloud Storage Tested as a secondary storage option • iRODS @ SDSC, Amazon (S3), Nirvanix IEEE Pro, Sept/Oct 2010
Results of Study • Moving massive amounts of data “online” a limiting factor…more to come • Varying cost scenarios • (target < $500/TB/year) • Proprietary APIs (but some open source cloud implementations gaining steam) • But, entirely feasible as a decoupled ”storage service” in PDS4 • Low risk option is to explore as an operational, secondary copy and access point for planetary data Nirvanix iRODS @ SDSC Amazon
MER Planning on the Cloud * Credit: Khawaja Shams
S3 5x Archive, Compression, Encryption (in memory) Daily Mars Data Parallel Uploads to S3 Polyphony Schedules Backups for Each of the Last 5 Days Daily MER Planning: Backup to the Cloud* * Credit: Khawaja Shams, George Chang
S3 If Downloaded Backup Does Not Match Local Data Polyphony Immediately Schedules Another Backup of Inconsistent Data MER Planning: Data Integrity on the Cloud
Big Data Challenges • Storage • Computation • Movement of Data • Heterogeneity • Distribution …can affect how we generate, manage, and analyze science data.
Cloud Computing and Computation • On-demand computation (scaling to massive number of cores) • Amazon EC2, one of the most popular • Commoditizing super-computing • Again, architecting systems to decouple “processing” and “computation” so it can be executed on the cloud is key… two examples • LMMP example (to come) • Airborne data processing (to come) • Coupled with computational frameworks (e.g., Apache Hadoop) • Open source implementation of Map-Reduce
Lunar Mapping and Modeling Project: Big Data Challenges* • The image files LMMP manages range from a few gigabytes to hundreds of gigabytes in size with new data arriving every day • Lunar surface images are too large to efficiently load and manipulate in memory • LMMP must make the data readily available in a timely manner for users to view and analyze • LMMP needs to accommodate large numbers of users with minimal latency * Credit: Emily Law, George Chang
Cloud Computing Solutions with Map-Reduce • Slice a large image into many small images and to merge and resize until the last merge and reduce yields a reasonably sized image that depicts the entire image • Amazon EC2 for computing; S3 for storage • Installed Hadoop framework on a number of EC2 instances • Used distributed approach with Elastic Map-Reduce in Hadoop to tile images • Developed a hybrid solution (multi-tiered data access approach) to serve images to users by cloud storage
LMMP Tiling Test Results(Cloud vs Local) • Configuration 1 • 2x Sun Fire 4170 • Gigabit Network Interconnects • 72 GB RAM • 64 GB SSD Storage • $10K each, plus administration and infrastructure costs • Configuration 2 • 20 EC2 Large Instances (4 Compute Units ~ 4x1GHz Xeon) • 7.5 GB RAM • 850 GB Storage • $0.34/instance/hour • Configuration 3 • 4 EC2 CC Instances (33.5 Compute Units) • Gigabit Interconnects • 23 GB RAM • 1.69 TB Storage • $1.60/instance/hour
Cloud Computing: Addressing Challenges • Cloud has shown very promising results, but there are challenges • Proprietary APIs • Support for ITAR-sensitive data • Data transfer rates to the commercial cloud • Firewall issues • Procurement • Costs for long term storage • More work ahead • Amazon EC2/S3 reported an “ITAR Region” available • Continued benchmarking and optimization has demonstrated increased data transfer rates, particularly using Internet2 • JPL developing a “Virtual Private Cloud” connection to Amazon, causing EC2 nodes to appear inside the JPL Firewall • Improved procurement process to allow JPL projects to use AWS
Big Data Challenges • Storage • Computation • Movement of Data • Heterogeneity • Distribution …can affect how we generate, manage, and analyze science data.
The Planetary Data Movement Experiment • Online data movement has been a limiting factor for embracing big data technologies • Conducted in 2006*, 2009 and 2012 • Evaluate trade offs for moving data • to PDS • between Nodes • to NSSDC/deep archive • to Cloud * C. Mattmann, S. Kelly, D. Crichton, J. S. Hughes, S. Hardman, R. Joyner and P. Ramirez. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of the NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST2006), pp. 131-135, College Park, Maryland, May 15-18, 2006
Data Xfer Technologies Evaluated • FTP uses a single connection from transferring files; in general it is ubiquitous and where possible the simplest way for PDS to transfer data electronically • bbFTP uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit • GridFTP uses multiple threads/connections. It is part of the Globus project and is used by the climate research community to move models. In general, tests have shown that it is more difficult to set up due to the security infrastructure, etc • iRODS uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit • FDT uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit
Some of our Findings Data Movement of WAN using TCP/IP • Transfer speed among the nodes differ greatly, however, the fundamental findings about how to best transfer data for each scenario is consistent • Parallel transfer mechanisms show improvement over conventional transfer mechanisms (FTP, socket-to-socket) for files larger than ~10MB • Packaging/bundling small files help to achieve significantly better transfer performance with parallel data transfer • Reliability has improved over the past five years in many of the products we have tested • However, UDP approaches have suffered largely due to more aggressive network infrastructure seeing this as distributed denial of service attacks (DDOS) Transfer rate (Y axis) versus file size (X axis)GridFTP: blue, bbFTP: red, FTP: green
Pilot with DNs (Big Data) • iRODS has shown to be the most promising for data transfer • Setting up an iRODS infrastructure for data movement with 3 zones: GEO, USGS, JPL/IMG as a pilot • Run along side other mechanisms • Expand to other nodes if this proves successful
Recommendations • Data Movement • PMWG will update its current data movement recommendations based on these results • Run current data movement deployment in parallel to FTP and other mechanisms as a pilot • Consider adding another “zone” at NSSDC for electronic data transfers • Capture updated benchmarks for Flagstaff after the network upgrade • Other DNs worry about this when they hit the larger thresholds • Data Storage • We have quite a bit of experience now with cloud computing, etc to comment • Focus on requirements for data storage (e.g., storage service) as other development activities are under control • Computation • The new PDS4 architecture allows us to run computationally intensive services in many different topologies. Explore as needed.