500 likes | 660 Views
Kyonggi University. DBLAB. Haesung Lee. WINTER. Template. Distributed Computing at Web Scale. Introduction. Data analysis at a large scale. Very large data collections (TB to PB) stored on distributed file systems Query logs Search engine indexed Sensor data
E N D
Kyonggi University. DBLAB. Haesung Lee WINTER Template Distributed Computing at Web Scale
Data analysis at a large scale • Very large data collections (TB to PB) stored on distributed file systems • Query logs • Search engine indexed • Sensor data • Need efficient ways for analyzing, reformatting, processing them • In particular, we want • Parallelization of computation (benefiting of the processing power of all nodes in a cluster) • Resilience to failure
Centralized computing with distributed data storage • Run the program at client node, get data from the distributed system • Downsides: important data flows, no use of the cluster computing resources
Pushing the program near the data • MapReduce: A programming model to facilitate the development and execution of distributed tasks • Published by Google Labs in 2004 at OSDI (DG04). Widely used since then, open-source implementation in Hadoop
MapReduce in Brief • The programmer defines the program logic as two functions • Map transforms the input into key-value pairs to process • Reduce aggregates the list of values for each key • The MapReduce environment takes in charge distribution aspects • A complex program can be decomposed as a succession of Map and Reduce tasks • Higher-level languages (Pig, Hive, etc.) help with writing distributed applications
Three operations on key-value pairs • User-defined: • Fixed behavior: • regroups all the intermediate pairs on the key • User-defined:
Job workflow in MapReduce • Each pair at each phase, is processed independently from the other pairs • Network and distribution are transparently managed by the MapReduce environment
A MapReduce cluster • Node inside a MapReduce cluster are decomposed as follows • A jobtracker acts as a master node • MapReduce jobs are submitted to it • Several tasktrackers runs the computation itself, i.e., map and reduce tasks • A given tasktracker may run several tasks in parallel • Tasktrackers usually also act as data nodes of a distributed filesystem (e.g. GFS, HDFS)
Processing a MapReduce job • A MapReduce job takes care of the distribution, synchronization and failure handling • The input is split into M groups; each group is assigned to a mapper (assignment is based on the data locality principle) • Each mapper processes a group and stores the intermediate pairs locally • Grouped instances are assigned to reducers thanks to a hash function • Intermediate pairs are sorted on their key by the reducer • Remark: the data locality does no longer hold for the reduce phase, since it reads from the mappers
Assignment to reducer and mappers • Each mapper task processes a fixed amount of data (split), usually set to the distributed filesystem block size (e.g., 64MB) • The number of mapper nodes is function of the number of mapper tasks and the number of available nodes in the cluster: each mapper nodes can process (in parallel and sequentially) several mapper tasks • Assignment to mapper tries optimizing data locality • The number of reducer tasks is set by the user • Assignment to reducers is done through a hashing of the key, usually uniformly at random; no data locality possible
Failure management • In case of failure, because the tasks are distributed over hundreds or thousands of machines, the chances that a problems occurs somewhere are much larger; starting the job from the beginning is not a valid option • The Master periodically checks the availability and reachability of the tasktrackers (heartbeats) and whether map or reduce jobs make any progress • If a reducer fails, its task is reassigned to another tasktracker; this usually require restarting mapper tasks as well • If a mapper fails, its task is reassigned to another tasktracker • If the jobtracker fails, the whole job should be re-initiated
Joins in MapReduce • Two datasets, A and B that we need to join for a MapReduce task • If one of the dataset is small, it can be sent over fully to each tasktracker and exploited inside the map (and possibly reduce) functions • Otherwise, each dataset should be grouped according to the join key, and the result of the join can be computing in the reduce function • Not very convenient to express in MapReduce
Using MapReduce for solving a problem • Prefer • Simple map and reduce functions • Mapper tasks processing large data chunks (at least the size of distributed filesystem block) • A given application may have • A chain of map functions (input processing, filtering, extraction…) • A sequence of several map-reduce jobs • No reduce task when everything can be expressed in the map (zero reducers, or the identity reducer function) • Not the right tool for everything (see further)
Data replication and consistency • Replication • A mechanism that copies data item located on a machine A to a remote machine B • One obtains replica • Consistency • Ability of a system to behave as if the transaction of each user always run in isolation from other transactions and never fails • Example: shopping basket in an e-commerce application • Difficult in centralized systems because of multi-users and concurrency • Even more difficult in distributed systems because of replica
Data replication and consistency • Some illustrative scenarios • Case (a) : Eager, primary • Because the replication is managed synchronously, a read request by Client 2 always access a consistent state of d . • Because there is a primary copy, requests sent by several clients relating to a same item can be queued, which ensures that updates are applied sequentially and not in parallel. • Down side • Applications have wait for the completion of other client’s request, both for writing and reading
Data replication and consistency • Some illustrative scenarios • Case (b) : Async, primary • There is still a primary copy, but the replication is asynchronous. • Thus, some of the replicas may be out of date with respect to Client’s requests • Because of the primary copy, the replicas will be eventually consistent because there cannot be independent updates of distinct replicas • Being considered acceptable in many modern, NoSQL data management systems that accept to trade strong consistency for a higher read throughput
Data replication and consistency • Some illustrative scenarios • Case (c) : Eager, no primary • A complex situation where two clients can simultaneously write on distinct replicas • The eager replication implies that these replication must be synchronized right away. • Some kind of interlocking, where both Clients wait for some resource locked by another one
Data replication and consistency • Some illustrative scenarios • Case (d) : Async, no primary • The most flexible case • Client operations are never stalled by concurrent operations, at the price of possibly inconsistent states • Data reconciliation
Data replication and consistency • Consistency management in distributed systems • Strong consistency (ACID properties) requires a synchronous replication, and possibly heavy locking mechanisms • Weak consistency – accept to serve some requests with outdated data • Eventual consistency – same as before, but the system is guaranteed to coverage towards a consistent state based on the last version. • In a system that is not eventually consistent, conflicts occur and the application must take care of data reconciliation: given the two conflicting copies, determine the new current one • Standard RDBMS favor consistency over availability – one of the NoSQL trend
Distributed transactions • A transaction is a sequence of data update operations that is required to be an all-or-nothing unit of work • The two-phase commit(2PC) • Main algorithm of choice to ensure ACID properties in a distributed setting • First, the coordinator asks each participant whether it is able to perform the required operation with a Prepare message • Second, if all participants answered with a confirmation, the coordinator sends a Decision message: the transaction is then committed at each site
Distributed transactions • Two phase commit
Failure recovery • Recovery techniques for centralized architecture • Recovery techniques for replicated architecture
Failure recovery • Replicated architectures • Synchronous protocol • The server acknowledges the Client only when the remote nodes have sent a confirmation of the successful completion of their write() operation • This may severely hinder the efficiency of updates, but the obvious advantage is that all the replicas are consistent • Asynchronous protocol • The client application waits only until one of the copies has been effectively written • The multi-log recovery process has a cost, but it brings availability
Scalability • Scalability refers to the ability of a system to continuously evolve in order to support an ever-growing amount of tasks • A scalable system should (i) distribute evenly the task load to all participants, and (ii) ensure a negligible distribution management cost
Availability • Availability is the capacity of a system to limit as much as possible its latency • Failure detection • Monitor the participating nodes to detect failures as early as possible (usually via “heartbeat” messages) • Design quick restart protocols • Replication on several nodes • A scalable system should (i) distribute evenly the task load to all participants, and (ii) ensure a negligible distribution management cost
Efficiency • Two usual measures of its efficiency are the response time (or latency) which denotes the delay to obtain the first item, and the throughput (or bandwidth) which denotes the number of items delivered in a given period unit • Unit costs • Number of messages globally sent by the nodes of the system, regardless of the message size • Size of message representing the volume of data exchanges
The CAP theorem • No distributed system can simultaneously provide all three of the following properties • Consistency: all nodes see the same data at the same time • Availability: node failures do not prevent survivors from continuing to operate • Partition tolerance: the system continues to operate despite arbitrary message loss
Basics: Centralized Hash files • The collection consists of (key, value) pairs • A hash function evenly distributes the values in buckets w.r.t the key • This is the basic, static scheme • The number of buckets is fixed • Dynamic hashing extends the number of buckets as the collection grows • The most popular method is linear hashing
Basics: Centralized Hash files • Issues with hash structures distribution • Straightforward idea: everybody uses the same hash function, and buckets are replaced by servers • Two issues • Dynamicity: At web scale, we must be able to add or remove servers at any moment • Inconsistencies: It is very hard to ensure that all participants share an accurate view of the system (e.g. the hash function)
Consistent hashing • Let N be the number of servers. The following functions • maps a pair (key, value) to server • If N changes, or if a client uses an invalid value of N, the mapping becomes inconsistent • With consistent hashing, addition or removal of an instance does not significantly change the mapping of keys to servers • A simple, non-mutable hash function h maps both keys and the servers IPs to a large address space A • A is organized as a ring, scanned in clockwise order • If S and S’ are two adjacent servers on the ring: all the keys in range [h(S), h(S’)] are mapped to S’
Consistent hashing • Illustration • A server is added or removed? • A local re-hashing is sufficient
Consistent hashing • What if a server fails? How can we balance the load? • Failure • Use replication: Put a copy on the next machine (on the ring), then on the next after the next. And so on. • Load balancing • Map a server to several points on the ring • The more points, the more • load received by a server • Also useful if the server fails • Also useful in case • of heterogeneity (the rule in large-scale systems)
Consistent hashing • Distributed indexing based on consistent hashing • Where is the hash directory? (several possible answers) • On a specific (“Master”) node acting as a load balancer • Raises scalability issues • Each node records its successor on the ring • Many require O(N) messages for routing queries – not resilient to failures • Each node records logN carefully chosen other nodes • Ensures O(log N) messages for routing queries • Full duplication of the hash directory at each node • Ensure 1 message for routing – heavy maintenance protocol which can be achieved through gossiping (broadcast of any event affecting the network topology)
History and development of GFS • Google File System, a paper published in 2003 by Google Labs at OSDI • Explains the design and architecture of a distributed system serving very large data files • Internally used by Google for storing documents collected from the Web • Open Source versions have been developed at once • Hadoop File System (HDFS), and Kosmos File System (KFS)
The problem • Why do we need a distributed file system in the first place? • Fact • Standard NFS (left part) does not meet scalability requirements (what if file 1 get really big?) • Right part • GFS/HDFS storage based on (i) a virtual file namespace, and (ii) partitioning of files in “chunks”
Architecture • A master node performs administrative tasks, while servers store “chunks” and send them to Client nodes • The Client maintains a cache with chunks locations, and directly communicates with servers
Workflow of a write() operation • The following figure shows a non-concurrent append() operation • In case of concurrent appends of a chunk, the primary replica assign serial numbers to the mutation, and coordinates the secondary replicas
Namespace updates: distributed recovery protocol • Extension of standard techniques for recovery (left: centralized ; right: distributed) • If a node fails, the replicated log file can be used to recover the last transactions on one of its mirrors