1 / 24

Datacenter As a Computer

Explore datacenter computing strategies, storage hierarchy, networking topologies, power management, and application-level software. Learn about resource allocation models, memory management techniques, and network scheduling. Discover the challenges and advancements in datacenter operating systems.

mcdill
Download Presentation

Datacenter As a Computer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Datacenter As a Computer Mosharaf Chowdhury EECS 582 – W16

  2. Announcements • Midterm grades are out • There were many interesting approaches. Thanks! • Meeting on April 11 moved earlier to April 8 (Friday) • No reviews for the papers for April 8 • Meeting on April 13 moved later to April 15 (Friday) EECS 582 – W16

  3. Mid-Semester Presentations • Strictly followed 20 minutes per group • 15 + 5 minutes Q/A • Four parts: motivation, approachoverview, current status, and end goal • March 21 • Juncheng and Youngmoon • Andrew • Dong-hyeon and Ofir • Hanyun, Ning, and Xianghan • March 23 • Clyde, Nathan, and Seth • Chao-han, Chi-fan, and Yayun • Chao and Yikai • Kuangyuan, Qi, and Yang EECS 582 – W16

  4. Why is One Machine Not Enough? • Too much data • Too little storage capacity • Not enough I/O bandwidth • Not enough computing capability EECS 582 – W16

  5. Warehouse-Scale Computers • Single organization • Homogeneity (to some extent) • Cost efficiency at scale • Multiplexing across applications and services • Rent it out! • Many concerns • Infrastructure • Networking • Storage • Software • Power/Energy • Failure/Recovery • … EECS 582 – W16

  6. Architectural Overview Aggregation Memory Bus ToR PCIe Ethernet SATA Server EECS 582 – W16

  7. Datacenter Networks • Traditional hierarchical topology • Expensive • Difficult to scale • High oversubscription • Smaller path diversity • … Core Agg. Edge EECS 582 – W16

  8. Datacenter Networks • CLOS topology • Cheaper • Easier to scale • NO/low oversubscription • Higher path diversity • … Core Agg. Edge EECS 582 – W16

  9. Storage Hierarchy • L1 cache • L2 cache • L3 cache • RAM • 3D Xpoint • SSD • HDD • Across machines, racks, and pods (https://www.youtube.com/watch?v=IWsjbqbkqh8) EECS 582 – W16

  10. Power, Energy, Modeling, Building,… • Many challenges • We’ll focus primarily on software infrastructure in this class EECS 582 – W16

  11. Datacenter Needs an Operating System • Datacenter is a collection of • CPU cores • Memory modules • SSDs and HDDs • All connected by an interconnect • A computer is a collection of • CPU cores • Memory modules • SSDs and HDDs • All connected by an interconnect EECS 582 – W16

  12. Some Differences • High-level of parallelism • Diversity of workload • Resource heterogeneity • Failure is the norm • Communication dictates performance EECS 582 – W16

  13. Three Categories of Software • Platform-level • Software firmware that are present in every machine • Cluster-level • Distributed systems to enable everything • Application-level • User-facing applications built on top EECS 582 – W16

  14. Common Techniques EECS 582 – W16

  15. Common Techniques EECS 582 – W16

  16. Datacenter Programming Models • Fault-tolerance, scalable, and easy access to all the distributed datacenter resources • Users submit jobs to these models w/o having to worry about low-level details • MapReduce • Grandfather of big data as we know today • Two-stage, disk-based, network-avoiding • Spark • Common substrate for diverse programming requirements • Many-stage, memory-first EECS 582 – W16

  17. Datacenter “Operating Systems” • Fair and efficient distribution of resources among many competing programming models and jobs • Does the dirty work so that users won’t have to • Mesos • Started with a simple question – how to run different versions of Hadoop? • Fairness-first allocator • Borg • Google’s cluster manager • Utilization-first allocator EECS 582 – W16

  18. Resource Allocation and Scheduling • How do we divide the resources anyway? • DRF • Multi-resource max-min fairness • Two-level; implemented in Mesos and YARN • HUG: DRF + High utilization • Omega • Shared-state resource allocator • Many schedulers interact through transactions EECS 582 – W16

  19. File Systems • Fault-tolerant, efficient access to data • GFS • Data resides with compute resources • Compute goes to data; hence, data locality • The game changer: centralization isn’t too bad! • FDS • Data resides separately from compute • Data comes to compute; hence, requires very fast network EECS 582 – W16

  20. Memory Management • What to store in cache and what to evict? • PACMan • Disk locality is irrelevant for fast-enough network • All-or-nothing property: caching is useless unless all tasks’ inputs are cached • Best eviction algorithm for single machine isn’t so good for parallel computing • Parameter Server • Shared-memory architecture (sort of) • Data and compute are still collocated, but communication is automatically batched to minimize overheads EECS 582 – W16

  21. Network Scheduling • Communication cannot be avoided; how do we minimize its impact? • DCTCP • Application-agnostic; point-to-point • Outperforms TCP through ECN-enabled multi-level congestion notifications • Varys • Application-aware; multipoint-to-multipoint; all-or-nothing in communication • Concurrent open-shop scheduling with coupled resources • Centralized network bandwidth management EECS 582 – W16

  22. Unavailability and Failure • In a 10000-server DC, with 10000-day MTBF machines, one machine will fail everyday on average • Build fault-tolerant software infrastructure and hide failure-handling complexity from application-level software as much as possible • Configuration is one of the largest sources of service disruption • Storage subsystems are the biggest sources of machine crashes • Tolerating/surviving from failures is different from hiding failures EECS 582 – W16

  23. What’s the most critical resource in a datacenter? • Why? EECS 582 – W16

  24. Will we come back to client-centric models? • As opposed to server-centric/datacenter-driven model today • If yes, why and when? • If not, why not? EECS 582 – W16

More Related