A BigData Tour – HDFS, Ceph and MapReduce

A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing – SICS; Yahoo! Developer Network MapReduce Tutorial

EXTRA MATERIAL

CEPH – A HDFS replacement

What is Ceph? • Ceph is a distributed, highly available unified object, block and file storage system with no SPOF running on commodity hardware

Ceph Architecture – Host Level • At the host level… • We have Object Storage Devices (OSDs) and Monitors • Monitors keep track of the components of the Ceph cluster (i.e. where the OSDs are) • The device, host, rack, row, and room are stored by the Monitors and used to compute a failure domain • OSDs store the Ceph data objects • A host can run multiple OSDs, but it needs to be appropriately provisioned http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – Block Level • At the block device level... • Object Storage Device (OSD) can be an entire drive, a partition, or a folder • OSDs must be formatted in ext4, XFS, or btrfs (experimental). https://hkg15.pathable.com/static/attachments/112267/1423597913.pdf?1423597913

Ceph Architecture – Data Organization Level • At the data organization level… • Data are partitioned into pools • Pools contain a number of Placement Groups (PGs) • Ceph data objects map to PGs (via a modulo of hash of name) • PGs then map to multiple OSDs. https://hkg15.pathable.com/static/attachments/112267/1423597913.pdf?1423597913

Ceph Placement Groups • Ceph shards a pool into placement groups distributed evenly and pseudo-randomly across the cluster • The CRUSH algorithm assigns each object to a placement group, and assigns each placement group to a set of OSDs—creating a layer of indirection between the Ceph client and the OSDs storing the copies of an object • The CRUSH algorithm dynamically assigns each object to a placement group and then assigns each placement group to a set of Ceph OSDs • This layer of indirection allows the Ceph storage cluster to re-balance dynamically when new Ceph OSD come online or when Ceph OSDs fail RedHatCeph Architecture v1.2.3

Ceph Architecture – Overall View https://www.terena.org/activities/tf-storage/ws16/slides/140210-low_cost_storage_ceph-openstack_swift.pdf

Ceph Architecture – RADOS • An Application interacts with a RADOS cluster • RADOS (Reliable Autonomic Distributed Object Store) is a distributed object service that manages the distribution, replication, and migration of objects • On top of that reliable storage abstraction Ceph builds a range of services, including a block storage abstraction (RBD, or RadosBlock Device) and a cache-coherent distributed file system (CephFS). http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – RADOS Components http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – Where Do Objects Live? http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – Where Do Objects Live? • Contact a Metadata server? http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – Where Do Objects Live? • Or calculate the placement via static mapping? http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – CRUSH Maps http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – CRUSH Maps • Data objects are distributed across Object Storage Devices (OSD), which refers to either physical or logical storage units, using CRUSH (Controlled Replication Under Scalable Hashing) • CRUSH is a deterministic hashing function that allows administrators to define flexible placement policies over a hierarchical cluster structure (e.g., disks, hosts, racks, rows, datacenters) • The location of objects can be calculated based on the object identifier and cluster layout (similar to consistent hashing), thus there is no need for a metadata index or server for the RADOS object store http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – CRUSH – 1/2 http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – CRUSH – 2/2 http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – librados http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – RADOS Gateway http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – RADOS Block Device (RBD) – 1/3 http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – RADOS Block Device (RBD) – 2/3 • Virtual Machine storage using RDB • Live Migration using RBD http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – RADOS Block Device (RBD) – 3/3 • Direct host access from Linux http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph Architecture – CephFS – POSIX F/S http://konferenz-nz.dlr.de/pages/storage2014/present/2.%20Konferenztag/13_06_2014_06_Inktank.pdf

Ceph – Read/Write Flows https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction

Ceph Replicated I/O RedHatCeph Architecture v1.2.3

Ceph – Erasure Coding – 1/5 • Erasure Code is a theory started at 1960s. The most famous algorithm is the Reed-Solomon. Many variations came out, like the Fountain Codes, Pyramid Codes and Local Repairable Codes. • Erasure Codes usually defines the number of total disks (N) and the number of data disks (K), and it can tolerate N – K failures with overhead of N/K • E,g, a typical Reed Solomon scheme: (8, 5), where 8 is the total disks, 5 is the data disks. In this case, the data in disks would be like: • RS (8, 5) can tolerate 3 arbitrary failures. If there’s some data chunks missing, then one could use the rest available data to restore the original content. https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction

Ceph – Erasure Coding – 2/5 • Like replicated pools, in an erasure-coded pool the primary OSD in the up set receives all write operations • In replicated pools, Ceph makes a deep copy of each object in the placement group on the secondary OSD(s) in the set • For erasure coding, the process is a bit different. An erasure coded pool stores each object as K+M chunks. It is divided into K data chunks and M coding chunks. The pool is configured to have a size of K+M so that each chunk is stored in an OSD in the acting set. • The rank of the chunk is stored as an attribute of the object. The primary OSD is responsible for encoding the payload into K+M chunks and sends them to the other OSDs. It is also responsible for maintaining an authoritative version of the placement group logs. https://software.intel.com/en-us/blogs/2015/04/06/ceph-erasure-coding-introduction

Ceph – Erasure Coding – 3/5 • 5 OSDs (K+M=5); sustain loss of 2 (M=2) • Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K • Coding blocks are YXY and QGC RedHatCeph Architecture v1.2.3

Ceph – Erasure Coding – 4/5 • On reading object NYAN from an erasure coded pool, decoding function retrieves chunks 1, 2, 3 and 4 • If any two chunks are missing (ie an erasure is present), decoding function can reconstruct other chunks RedHatCeph Architecture v1.2.3

Ceph – Erasure Coding – 4/5 • 5 OSDs (K+M=5); sustain loss of 2 (M=2) • Object NYAN with data “ABCDEGHI” is split into 3 chunks; padded if length is not a multiple of K • Coding blocks are YXY and QGC RedHatCeph Architecture v1.2.3

A BigData Tour – HDFS, Ceph and MapReduce