1 / 36

Thus Far

Thus Far. Locality is important!!! Need to get Processing closer to storage Need to get tasks close to data Rack locality: Hadoop Kill task and if a local slot is available: Quincy Why? Network is bad: gives horrible performance Why? Over-subscription of network. What Has Changed?.

shaman
Download Presentation

Thus Far

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thus Far • Locality is important!!! • Need to get Processing closer to storage • Need to get tasks close to data • Rack locality: Hadoop • Kill task and if a local slot is available: Quincy • Why? • Network is bad: gives horrible performance • Why? • Over-subscription of network

  2. What Has Changed? • Network is no-longer over-subscribed • Fat-tree, VL2 • Network has fewer congestion points • Helios, C-thru, Hedera, MicroTE • Server uplinks are much faster • Implications: network transfers are much faster • Network is now just as fast as Disk I/o • Difference between local and rack local is only 8% • Storage practices have also changed • Compression is being used • Smaller amount of data need to be transferred • De-replication is being practiced • There’s only one copy so locality is really hard to achieve

  3. So What Now? • No need to worry about locality when doing placement • Placement can happen faster • Scheduling algorithms can be smaller/simpler • Network is as fast as SATA disk? But still a lot slower than SSD? • If SDD used then disk-locality is a problem AGAIN! • However too costly to be used for all storage

  4. Caching with Memory/SSD • 94% of all jobs can have input fit in memory • So a new problem is memory locality • Want to place a task where it will have access to data already in memory • Interesting challenges: • 46% of task use data that is never re-used • So need to pre-catch for these tasks • Current caching scheme are ineffective

  5. How do you build a FS that Ignore locality • FDS from MSR ignores locality • Eliminate networking problem to remove importance of locality • Eliminate meta-data server problems to improve throughput of the whole system

  6. Meta-data Server • Current meta-data server (name-node) • Stores mapping of chunks to servers • Central point of failure • Central bottle-neck • Processing issues: before anyone reads/writes must consult metadata server • Storage issues: must store location of EVERY chunk and size of every chunk

  7. FDS’s Meta-data Server • Only store list of servers: • smaller memory footprint: • # servers <<< # chunks • Clients only interact with it at startup • Not every-time they need to read/write • To determine where to read/write: Consistent hashing • Write/read data at server at this location in array • Hash(GUID)/#-server • # reads/writes <<<< # client boot

  8. Network Changes • Uses VL2 style Clos Network • Eliminates over-subscription+ congestion • 1 TCP doesn’t saturate Server 10-gig NIC • Use 5 TCP connections to saturate link • Since VL2, No congestion in core but maybe at receiver • Receiver controls the senders sending rate • Receiver sends rate-limiting messages to

  9. Disk locality is almost a distant problem • Advances in networking • Eliminate over-subscription/congestion • We have prototype of FDS that doesn’t need locality • Uses VL2 • Eliminates meta-data servers • New problem, new challenges • Memory locality • New cache replacement techniques • New pre-caching schemes

  10. Class Wrap-UP • What have we covered and learned? • The big data-stack • How to optimize each layer? • What are the challenges in each layer? • Are there any opportunities to optimize across layers?

  11. Big-Data Stack: App Paradigms • Commodity devices impact the design of application paradigms • Hadoop: dealing with failures • Addresses n/w oversubscription—rack aware placement • Straggler detection mitigation --- restart tasks • Dryad: hadoop for smarter programmers • Can create more expressive task DAGs (non cyclic) • Can determine which should run locally on same devs • Dryad does optimizations: adds extra nodes to do temp aggregation

  12. App Hadoop Dryad Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage

  13. Big-Data Stack: App Paradigms Revisited • User visible services are complex and composed of multiple M-R jobs • Flume & DryadLinQ • Delay Execution until output is required • Allows for various optimizations • Storing output to HDFS between M-R jobs adds times • Eliminate HDFS between jobs • Programmers aren’t smart, often have extra un-necessarily steps • Knowing what is required for output, you can eliminate unnecessary

  14. FlumeJava App DryadLinQ Hadoop Dryad Sharing Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage

  15. Big-Data Stack: App Paradigms Revisited-yet-again • User visible services require interactivity: so jobs need to be fast. Jobs should return results before completing processing • Hadoop-Online: • pipeline results from map to reduce before done. • Pipeline too early and reduce need to do sorting • Increases processing overhead on reduce: BAD!!! • RRD: Spark • Store data in memory: much faster than disk • Instead of doing process: create abstract graph of processing and to processing when output is required • Allows for optimizations • Failure recovery is the challenge

  16. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage

  17. Big-Data Stack: Sharing in CaringHow to share a non-virtualized cluster • Sharing is good: you have too much data and cost too much to build many cluster for same data • Need dynamic sharing: if static, you can waste • Mesos: • Resource offers: give app options of resources and let them pick • App knows best • Omega: • Optimistic allocation: each scheduler picks resources and if there’s a conflict omega detects this and gives resources to only one. Others pick new resources • Even with conflicts this is much better than centralized entity

  18. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage

  19. Big-Data Stack: Sharing in CaringCloud Sharing • Clouds gives the illusion of equality • H/W differences  diff performance • Poor isolation  tenants can impact each other • I/O and CPU bound jobs can conflict.

  20. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Virt Drawbacks N/W Paradign Tail Latency N/W Sharing SDN Storage

  21. Big-Data Stack: Better Networks • Networks give bad performance • Cause: Congestion + over-subscription • VL2/Portland • Eliminate over-subscription + congestion with commodity devices+ECMP • Helios/C-through • Mitigate congestion by carefully adding new capacity

  22. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign N/W Paradign Tail Latency N/W Sharing SDN Storage

  23. Big-Data Stack: Better Networks • When you need multiple servers to service a request • .99100 = .65 (HORRIBLE) • Duplicate requests: send same request to 2 servers • At-least one will finish within acceptable time • Dolly: be smart when selecting the 2 servers • You don’t want I/O contention because that leads to bad perf • Avoid Maps using same replicas • Avoid Reducers reading same intermediate output

  24. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency

  25. Big-Data Stack: Networks Sharing • How to share efficiently while making guarantees • Elastic-Switch • Two level bandwidth allocation system • Orchestra • M/R has barriers and completion is based on a set of flows not individual flows • Make optimization to a set of flows • Hull: Trade BW for latency • Want zero buffering: but TCP needs buffering • Limit traffic to 90% of link and use the remaining 10% as buffers

  26. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra SDN Storage

  27. Big-Data Stack: Enter SDN • Remove the Control plane from the switches and centralize it • Centralization == Scalability challenges • NOX: how does it scale to data-centers • How many controllers do you need? • How should you design these controllers: • Kandoo: a hierarchy (many local and 1 global controller, local communicate with the global;) • ONIX: a mesh (communication through a DHT or DB)

  28. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra Kandoo ONIX SDN Storage

  29. Big Data Stack: SDN+Big-Data • FlowComb: • Detect app patterns and have SDN controller assign paths based on knowledge of traffic patterns and contention • Sinbad: • HDFS writes are important • Let SDN controller tell HDFS best place to write data to based on knowledge of n/w congesetion

  30. App FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign N/W Paradign Tail Latency Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing N/W Sharing Elastic Cloud Orchestra SDN Kandoo ONIX SDN FlowComb SinBaD Storage

  31. Big Data Stack: Distributed Storage • Ideal: Nice API, low latency, scalable • Problem: H/W fails a lot, in limited locations, and contains limited resources • Partition: gives good performance • Cassandra: use consistent hashing • Megastore: each partition == A RDBMS with good consistency guarantees • Replicate: Multiple copies avoid failures • Megastore: replicas allow for low latency

  32. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra Kandoo ONIX SDN FlowComb SinBaD Megastore Casandra Storage

  33. Big Data Stack: Disk locality Irrelevant • Disk Locality is becoming irrelevant • Data is getting smaller (compressed) so smaller times • Networks are getting much faster (only 8% slower) • Mem locality is new challenge • Input for 94% fit in mem • Need new caching+prefetching schemes

  34. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra Kandoo ONIX SDN FlowComb SinBaD Megastore Casandra Storage Disk-locality irrelevant FDS

  35. FlumeJava HadoopOnline App DryadLinQ Spark Hadoop Dryad Sharing Omega Mesos BobTail RFA CloudGaming Virt Drawbacks Hedera VL2 Portland C-Thru Helios MicroTE N/W Paradign Mantrei Tail Latency Dolly (Clones) Tail Latency Hull N/W Sharing Elastic Cloud Orchestra Kandoo ONIX SDN FlowComb SinBaD Megastore Casandra Storage Disk-locality irrelevant FDS

More Related