1 / 25

Warehouse-Scale Computing

Warehouse-Scale Computing. Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture. Overview. Motivation Explore architectural issues as the computing moves toward the could Impact of sharing memory subsystem resources (LLC, memory bandwidth ..)

candra
Download Presentation

Warehouse-Scale Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Warehouse-Scale Computing Mu Li, KiryongHa 10/17/2012 15-740 Computer Architecture

  2. Overview • Motivation • Explore architectural issues as the computing moves toward the could • Impact of sharing memory subsystem resources (LLC, memory bandwidth ..) • Maximize resource utilization by co-locating applications without hurting QoS • Inefficiencies on traditional processors for running scale-out workloads

  3. Overview

  4. Impact of memory subsystem sharing

  5. Impact of memory subsystem sharing • Motivation & Problem definition • Machines have multi-core, multi-socket • For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB)  It is important to understand the memory sharing interaction between (datacenter) applications

  6. Impact of thread-to-core mapping Sharing Cache SeparateFSBs (XX..XX..) Sharing CacheSharingFSBs (XXXX….) Separate CacheSeparateFSBs (X.X.X.X.)

  7. Impact of thread-to-core mapping - Performance varies up to 20% - Each Application has different trend. - TTC behavior changes depending on co-located application. <content analyzer co-located with other>

  8. Observation • Performance can significantly swing simply based on how application threads are mapped to cores. • Best TTC mapping changes depends on co-located program. • Application characteristics that impact performance • Memory bus usage, Cache line sharing, Cache footprint • Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint  Works better if it doesn’t share LLC and FSB STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB

  9. Increasing Utilization in Warehouse scale Computers via Co-location

  10. Increasing Utilization via Co-location • Motivation • Cloud computing wants to get higher resource utilization. • However, overprovisioning is used to ensure the performance isolation for latency-sensitive task, which lowers the utilization.  Needprecise prediction in shared resource for better utilization without violating QoS. <Google’s web search QoS co-located with other products>

  11. Bubble-up Methodology • QoSsensitivity curve (  ) • Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem • Bubble score (  ) • Get amount of pressure that the application causes on a reporter <sensitivity curve for Bigtable> <sensitivity curve> <Pressure score>

  12. Better Utilization • Now we know • how QoS changes depending on bubble size (QoS curve) • how the application can affect to others (bubble number) • Can co-locate applications estimatiing changes on QoS <Utilization improvement with search-render under each QoS>

  13. Scale-out workload

  14. Scale-out workload • Examples: • Data Severing • Mapreduce • Media Streaming • SAT Solver • Web Frontend • Web Search

  15. Execution-time breakdown • A major part of time is waiting for caches misses  A clear micro-architectural mismatch

  16. Frontend ineffficiencies • Cores idle due to high instruction-cache miss rates • L2 caches increase average I-fetch latency • Excessive LLC capacity leads to long I-fetch latency • How to improve? • Bring instructions closer to the cores

  17. Core inefficiencies • Low instruction level parallelism precludes effectively using the full core width • Low memory level parallelism underutilizes reorder buffers and load-store queues. • How to improve? • Run many things together: multi-threaded multi-core architecture

  18. Data-access inefficiencies • Large LLC consumes area, but does not improve performance • Simple data prefetchers are ineffective • How to improve? • Reduce LLC, leave place for processers

  19. Bandwidth inefficiencies • Lack of data sharing deprecates coherence and connectivity • Off-chip bandwidth exceeds needs by an order of magnitude • How to improve? • Scale back on-chip interconnect and off-chip memory bus to give place for processors

  20. Scale-out processors • So, too large LLC, interconnect, memory bus, but not enough processors • Here comes a better one: Improve throughput by 5x-6.5x!

  21. Q&A or Discussion

  22. Supplement slides

  23. Datacenter Applications - Google’s production application

  24. Key takeaways • TTC behavior is mostly determined by • Memory bus usage (for FSB sharing) • Data sharing: Cache line sharing • Cache footprint: Use last level cache miss to estimate the foot print size • Example • CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint  Works better if it does not share LLC and FSB • Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch

  25. 1% prediction error on average Prediction accuracy for pairwise co-locations of Google applications

More Related