250 likes | 262 Views
This presentation explores the performance of multi-threaded Java applications on multicore hardware, examining the impact of thread scheduling and hardware resources on performance. The study includes experiments with frequency scaling, isolation of threads, and pairing of application and collection threads. Gain insights on how to optimize performance for multi-threaded applications on multicore hardware.
E N D
Exploring Multi-Threaded Java ApplicationPerformance on Multicore Hardware Exploring Multi-Threaded Java ApplicationPerformance on Multicore Hardware Jennifer B. Sartor, LievenEeckhout Ghent University, Belgium OOPSLA 2012 presentation – October 24th 2012
Modern Software & Hardware • Managed languages • Ubiquitous, but added runtime layer • Many service threads interact with application • JIT compilation, on-stack replacement, collector • Stop the application, possibly critical • Share hardware resources • Multicore with multiple sockets • How do we schedule threads with constrained resources? • Scale core frequency for power • Use caches of all sockets, or limit communication
Extensive Performance Study • Multi-threaded Java application on multicore, multi-socket hardware • Large space to explore • Number of threads • Thread-to-core/socket mapping • Pairing or isolating application and JVM threads • Pinning • Impact of frequency scaling • Difference between startup and steady state How do choices with scheduling and hardware resources affect performance?
Experimental Machine: Nehalem Scale frequency per socket to 1.596 or 3.059 GHz
Gain Insight on Scheduling • Application • Java Virtual Machine • Garbage collector • Just-in-time compiler with on-stack replacement • Cao, et al. [ISCA 2012] studied JVM amenability to heterogeneity by measuring service threads’ performance per energy • We study end-to-end performance
Roadmap • Cost of Isolation • Frequency Scaling Socket 1 Socket 0 • Pairing Threads Socket 1 Socket 0 Socket 1 Socket 0
Experimental Methodology • Jikes Research Virtual Machine (Dec 2011) • Generational Immix collector • 1.5, 2, and 3x minimum heap sizes • Multithreaded DaCapo benchmarks 9.12-bach • Avrora, lusearch (with fix), pmd, sunflow, xalan • Also, pseudojbb2005 • Timed 10 invocations • Steady state, measure 15th iteration • Startup, measure 1st iteration
Baseline Setup Application threads JVM service threads Pin application & collection threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Collection Compilation Socket 0 Socket 1
Boosting Socket Frequency 1.596 3.059 GHz Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 27-50% improvement in execution time Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Socket 0 Socket 1
Exploring The Cost of Isolation Collection threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1
Isolating Collection Threads Isolating collector does not significantly hurt performance
Exploring The Cost of Isolation Compiler thread Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1
Isolating Compiler Thread at Startup Isolating compiler at startup has little impact
Isolating On-Stack-Replace at Startup Isolating OSR at startup improves performance
Exploring The Cost of Isolation All JVM service threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1
Isolating All JVM Threads Isolating service threads only significantly hurts one benchmark
Exploring Frequency Scaling Baseline: JVM service threads isolated, all cores at highest frequency Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Socket 0 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 1
Exploring Frequency Scaling Lower frequency of JVM service threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 2 Nehalem Core 3 Nehalem Core 5 Nehalem Core 0 Nehalem Core 6 Nehalem Core 7 Nehalem Core 1 versus Lower frequency of application threads Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7
Lower Frequency: Collector vs App Lowering collector frequency affects performance 5x less than for application
Lower Freq at Startup: Compiler vs App Lowering compiler frequency is not detrimental compared to application
Lower Frequency: JVM vs App Lowering JVM frequency affects performance 5x less than for application
Exploring Pairing Threads Pair application and collection threads Nehalem Core 0 Nehalem Core 1 Nehalem Core 2 Nehalem Core 3 Nehalem Core 4 Nehalem Core 5 Nehalem Core 6 Nehalem Core 7 Socket 0 Socket 1
Pairing App & Collector, 2 Sockets With all but avrora, pairing application and collector performs best
Overall Performance Comparison Either use 1 socket, or isolate compiler thread
Conclusions: Scheduling Insights • 1 socket: # application = # collection threads • 2 sockets: • Isolate compilation thread • Pair application and collection threads • Set # application threads = # cores, fewer collection threads • Increasing application frequency is more important than for JVM service threads • Analyzed Java performance given hardware resources