360 likes | 493 Views
Thus Far. Locality is important!!! Need to get Processing closer to storage Need to get tasks close to data Rack locality: Hadoop Kill task and if a local slot is available: Quincy Why? Network is bad: gives horrible performance Why? Over-subscription of network. What Has Changed?.
E N D
Thus Far • Locality is important!!! • Need to get Processing closer to storage • Need to get tasks close to data • Rack locality: Hadoop • Kill task and if a local slot is available: Quincy • Why? • Network is bad: gives horrible performance • Why? • Over-subscription of network
What Has Changed? • Network is no-longer over-subscribed • Fat-tree, VL2 • Network has fewer congestion points • Helios, C-thru, Hedera, MicroTE • Server uplinks are much faster • Implications: network transfers are much faster • Network is now just as fast as Disk I/o • Difference between local and rack local is only 8% • Storage practices have also changed • Compression is being used • Smaller amount of data need to be transferred • De-replication is being practiced • There’s only one copy so locality is really hard to achieve
So What Now? • No need to worry about locality when doing placement • Placement can happen faster • Scheduling algorithms can be smaller/simpler • Network is as fast as SATA disk? But still a lot slower than SSD? • If SDD used then disk-locality is a problem AGAIN! • However too costly to be used for all storage
Caching with Memory/SSD • 94% of all jobs can have input fit in memory • So a new problem is memory locality • Want to place a task where it will have access to data already in memory • Interesting challenges: • 46% of task use data that is never re-used • So need to pre-catch for these tasks • Current caching scheme are ineffective
How do you build a FS that Ignore locality • FDS from MSR ignores locality • Eliminate networking problem to remove importance of locality • Eliminate meta-data server problems to improve throughput of the whole system
Meta-data Server • Current meta-data server (name-node) • Stores mapping of chunks to servers • Central point of failure • Central bottle-neck • Processing issues: before anyone reads/writes must consult metadata server • Storage issues: must store location of EVERY chunk and size of every chunk
FDS’s Meta-data Server • Only store list of servers: • smaller memory footprint: • # servers <<< # chunks • Clients only interact with it at startup • Not every-time they need to read/write • To determine where to read/write: Consistent hashing • Write/read data at server at this location in array • Hash(GUID)/#-server • # reads/writes <<<< # client boot
Network Changes • Uses VL2 style Clos Network • Eliminates over-subscription+ congestion • 1 TCP doesn’t saturate Server 10-gig NIC • Use 5 TCP connections to saturate link • Since VL2, No congestion in core but maybe at receiver • Receiver controls the senders sending rate • Receiver sends rate-limiting messages to
Disk locality is almost a distant problem • Advances in networking • Eliminate over-subscription/congestion • We have prototype of FDS that doesn’t need locality • Uses VL2 • Eliminates meta-data servers • New problem, new challenges • Memory locality • New cache replacement techniques • New pre-caching schemes
Class Wrap-UP • What have we covered and learned? • The big data-stack • How to optimize each layer? • What are the challenges in each layer? • Are there any opportunities to optimize across layers?
Big-Data Stack: App Paradigms • Commodity devices impact the design of application paradigms • Hadoop: dealing with failures • Addresses n/w oversubscription—rack aware placement • Straggler detection mitigation --- restart tasks • Dryad: hadoop for smarter programmers • Can create more expressive task DAGs (non cyclic) • Can determine which should run locally on same devs • Dryad does optimizations: adds extra nodes to do temp aggregation
App Hadoop Dryad Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage
Big-Data Stack: App Paradigms Revisited • User visible services are complex and composed of multiple M-R jobs • Flume & DryadLinQ • Delay Execution until output is required • Allows for various optimizations • Storing output to HDFS between M-R jobs adds times • Eliminate HDFS between jobs • Programmers aren’t smart, often have extra un-necessarily steps • Knowing what is required for output, you can eliminate unnecessary
FlumeJava App DryadLinQ Hadoop Dryad Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage
Big-Data Stack: App Paradigms Revisited-yet-again • User visible services require interactivity: so jobs need to be fast. Jobs should return results before completing processing • Hadoop-Online: • pipeline results from map to reduce before done. • Pipeline too early and reduce need to do sorting • Increases processing overhead on reduce: BAD!!! • RRD: Spark • Store data in memory: much faster than disk • Instead of doing process: create abstract graph of processing and to processing when output is required • Allows for optimizations • Failure recovery is the challenge
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage
Big-Data Stack: Sharing in CaringHow to share a non-virtualized cluster • Sharing is good: you have too much data and cost too much to build many cluster for same data • Need dynamic sharing: if static, you can waste • Mesos: • Resource offers: give app options of resources and let them pick • App knows best • Omega: • Optimistic allocation: each scheduler picks resources and if there’s a conflict omega detects this and gives resources to only one. Others pick new resources • Even with conflicts this is much better than centralized entity
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage
Big-Data Stack: Sharing in CaringCloud Sharing • Clouds gives the illusion of equality • H/W differences diff performance • Poor isolation tenants can impact each other • I/O and CPU bound jobs can conflict.
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage
Big-Data Stack: Better Networks • Networks give bad performance • Cause: Congestion + over-subscription • VL2/Portland • Eliminate over-subscription + congestion with commodity devices+ECMP • Helios/C-through • Mitigate congestion by carefully adding new capacity
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign N/W Paradign Tail Latency N/W Sharing SDN Storage
Big-Data Stack: Better Networks • When you need multiple servers to service a request • .99100 = .65 (HORRIBLE) • Duplicate requests: send same request to 2 servers • At-least one will finish within acceptable time • Dolly: be smart when selecting the 2 servers • You don’t want I/O contention because that leads to bad perf • Avoid Maps using same replicas • Avoid Reducers reading same intermediate output
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency
Big-Data Stack: Networks Sharing • How to share efficiently while making guarantees • Elastic-Switch • Two level bandwidth allocation system • Orchestra • M/R has barriers and completion is based on a set of flows not individual flows • Make optimization to a set of flows • Hull: Trade BW for latency • Want zero buffering: but TCP needs buffering • Limit traffic to 90% of link and use the remaining 10% as buffers
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra SDN Storage
Big-Data Stack: Enter SDN • Remove the Control plane from the switches and centralize it • Centralization == Scalability challenges • NOX: how does it scale to data-centers • How many controllers do you need? • How should you design these controllers: • Kandoo: a hierarchy (many local and 1 global controller, local communicate with the global;) • ONIX: a mesh (communication through a DHT or DB)
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra Kandoo ONIX SDN Storage
Big Data Stack: SDN+Big-Data • FlowComb: • Detect app patterns and have SDN controller assign paths based on knowledge of traffic patterns and contention • Sinbad: • HDFS writes are important • Let SDN controller tell HDFS best place to write data to based on knowledge of n/w congesetion
App FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign N/W Paradign Tail Latency Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing N/W Sharing Elastic Cloud Orchestra SDN Kandoo ONIX SDN FlowComb SinBaD Storage
Big Data Stack: Distributed Storage • Ideal: Nice API, low latency, scalable • Problem: H/W fails a lot, in limited locations, and contains limited resources • Partition: gives good performance • Cassandra: use consistent hashing • Megastore: each partition == A RDBMS with good consistency guarantees • Replicate: Multiple copies avoid failures • Megastore: replicas allow for low latency
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra Kandoo ONIX SDN FlowComb SinBaD Megastore Casandra Storage
Big Data Stack: Disk locality Irrelevant • Disk Locality is becoming irrelevant • Data is getting smaller (compressed) so smaller times • Networks are getting much faster (only 8% slower) • Mem locality is new challenge • Input for 94% fit in mem • Need new caching+prefetching schemes
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra Kandoo ONIX SDN FlowComb SinBaD Megastore Casandra Storage Disk-locality irrelevant FDS
FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra Kandoo ONIX SDN FlowComb SinBaD Megastore Casandra Storage Disk-locality irrelevant FDS