250 likes | 359 Views
Warehouse-Scale Computing. Mu Li, Kiryong Ha 10/17/2012 15-740 Computer Architecture. Overview. Motivation Explore architectural issues as the computing moves toward the could Impact of sharing memory subsystem resources (LLC, memory bandwidth ..)
E N D
Warehouse-Scale Computing Mu Li, KiryongHa 10/17/2012 15-740 Computer Architecture
Overview • Motivation • Explore architectural issues as the computing moves toward the could • Impact of sharing memory subsystem resources (LLC, memory bandwidth ..) • Maximize resource utilization by co-locating applications without hurting QoS • Inefficiencies on traditional processors for running scale-out workloads
Impact of memory subsystem sharing • Motivation & Problem definition • Machines have multi-core, multi-socket • For better utilization, applications should share Last Level Cache(LLC) / Front Side Bus (FSB) It is important to understand the memory sharing interaction between (datacenter) applications
Impact of thread-to-core mapping Sharing Cache SeparateFSBs (XX..XX..) Sharing CacheSharingFSBs (XXXX….) Separate CacheSeparateFSBs (X.X.X.X.)
Impact of thread-to-core mapping - Performance varies up to 20% - Each Application has different trend. - TTC behavior changes depending on co-located application. <content analyzer co-located with other>
Observation • Performance can significantly swing simply based on how application threads are mapped to cores. • Best TTC mapping changes depends on co-located program. • Application characteristics that impact performance • Memory bus usage, Cache line sharing, Cache footprint • Ex) CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it doesn’t share LLC and FSB STITCH use more Bus bandwidth, so co-located CONTENT ANALYZER will have contention on FSB
Increasing Utilization in Warehouse scale Computers via Co-location
Increasing Utilization via Co-location • Motivation • Cloud computing wants to get higher resource utilization. • However, overprovisioning is used to ensure the performance isolation for latency-sensitive task, which lowers the utilization. Needprecise prediction in shared resource for better utilization without violating QoS. <Google’s web search QoS co-located with other products>
Bubble-up Methodology • QoSsensitivity curve ( ) • Get the sensitivity of the application by iteratively increasing the amount of pressure to memory subsystem • Bubble score ( ) • Get amount of pressure that the application causes on a reporter <sensitivity curve for Bigtable> <sensitivity curve> <Pressure score>
Better Utilization • Now we know • how QoS changes depending on bubble size (QoS curve) • how the application can affect to others (bubble number) • Can co-locate applications estimatiing changes on QoS <Utilization improvement with search-render under each QoS>
Scale-out workload • Examples: • Data Severing • Mapreduce • Media Streaming • SAT Solver • Web Frontend • Web Search
Execution-time breakdown • A major part of time is waiting for caches misses A clear micro-architectural mismatch
Frontend ineffficiencies • Cores idle due to high instruction-cache miss rates • L2 caches increase average I-fetch latency • Excessive LLC capacity leads to long I-fetch latency • How to improve? • Bring instructions closer to the cores
Core inefficiencies • Low instruction level parallelism precludes effectively using the full core width • Low memory level parallelism underutilizes reorder buffers and load-store queues. • How to improve? • Run many things together: multi-threaded multi-core architecture
Data-access inefficiencies • Large LLC consumes area, but does not improve performance • Simple data prefetchers are ineffective • How to improve? • Reduce LLC, leave place for processers
Bandwidth inefficiencies • Lack of data sharing deprecates coherence and connectivity • Off-chip bandwidth exceeds needs by an order of magnitude • How to improve? • Scale back on-chip interconnect and off-chip memory bus to give place for processors
Scale-out processors • So, too large LLC, interconnect, memory bus, but not enough processors • Here comes a better one: Improve throughput by 5x-6.5x!
Datacenter Applications - Google’s production application
Key takeaways • TTC behavior is mostly determined by • Memory bus usage (for FSB sharing) • Data sharing: Cache line sharing • Cache footprint: Use last level cache miss to estimate the foot print size • Example • CONTENT ANALYZER has high bus usage, little cache sharing, large cache footprint Works better if it does not share LLC and FSB • Stich actually uses more Bus bandwidth, so it’s better for CONTENT ANALYZER not to share FSB with stitch
1% prediction error on average Prediction accuracy for pairwise co-locations of Google applications