220 likes | 429 Views
Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications. Hyoun Kyu Cho 1 , Tipp Moseley 2 , Richard Hank 2 , Derek Bruening 2 , Scott Mahlke 1. 1 University of Michigan 2 Google. Datacenter Applications. http://googleblog.blogspot.com.
E N D
Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications HyounKyu Cho1, Tipp Moseley2, Richard Hank2, Derek Bruening2, Scott Mahlke1 1University of Michigan 2Google
Datacenter Applications http://googleblog.blogspot.com • In 2010, US Datacenters spent 70~90 billion kWh* • Datacenter application performance is critical • Profiling can help *[Koomey`11]
Traditional Profiling Source Code • Challenges for Datacenters • Need to run on live traffic • Difficult to isolate • Overheads • Value profiling 3.8x slowdown1 • Path profiling 31%, edge profiling 16%2 • Binary management • Many programs, multiple versions Instrumentation Build Instrumented Binary Input Data Training Run Profile Data 1[Calder`99] 2[Ball`96]
Google-Wide Profiling • Continuous profiling infrastructure for datacenters • Negligible overhead • Sampling based • Aggregated profiling overhead less than 0.01% • Limitations • Heavily rely on Performance Monitoring Units • Limited flexibility and portabiliity [Ren et al.`10]
Goals • Unified profiling infrastructure for datacenters • Flexible types of profile data • Portable across heterogeneous datacenter • While maintaining • Low overhead • Does not burden binary management Dynamic Binary Instrumentation Sampling
Instrumentation Sampling application system call gateway operating system hardware
Instrumentation Sampling application dispatch instrumentation engine client context switch operating system code cache DynamoRIO hardware [Bruening`04]
Instrumentation Sampling application shepherding thread dispatch instrumentation engine client start profiling operating system code cache stop profiling hardware
Problems with Basic Implementation • Unbounded profiling periods due to fragment linking • Latency degradation due to initial instrumentation • Multi-threade programs
Temporal Unlinking/Relinking of Fragments context switch code cache BB1 dispatch BB2 BB2->BB1
S/W Code Cache Pre-population application shepherding thread • Still have latency degradation for intial instrumentation phases dispatch instrumentation engine client operating system code cache hardware
Multithreaded Program Support • Sampling makes it possible to miss thread operations • Forces Instant Profiling’s signal handler for every thread • Enumerates all threads and sends profiling start signal to each thread
Experimental Setup • 6-core Intel Xeon 2.67GHz w/ 12MB L3 • 12GB main memory • Linux kernel 2.6.32 • gcc 4.4.3 w/ -O3 • SPEC INT2006, BigTable, Web search • Edge profiling client
Conclusion • Low-overhead, portable, flexible profiling needed • Instant Profiling • Combines sampling and DBI • Pre-populates S/W code cache • Tunable tradeoff between overhead and information • Provides eventual profiling accuracy • Less than 5% overhead, more than 80% accuracy for naïve edge profiling client