240 likes | 380 Views
Dimitris Kaseridis & Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu. Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara). Unique Chips and Systems (UCAS-4). Outline.
E N D
Dimitris Kaseridis &Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu Performance Analysis of Multiple Threads/CoresUsing the UltraSPARC T1(Niagara) Unique Chips and Systems (UCAS-4)
Outline • Brief Description of UltraSPARC T1 Architecture • Analysis Objectives / Methodology • Analysis of Results • Interference on Shared Resources • Scaling of Multiprogrammed Workloads • Scaling of Multithreaded Workloads D. Kaseridis - Laboratory for Computer Architecture
UltraSPARC T1 (Niagara) • A multi-threaded processor that combines CMP & SMT in CMT • 8 cores with each one handling 4 hardware context threads 32 active hardware context threads • Simple in-order pipeline with nobranchpredictor unit per core • Optimized for multithreaded performance Throughput • High throughput hide the memory and pipeline stalls/latencies by scheduling other available threads with zero cycle thread switch penalty D. Kaseridis - Laboratory for Computer Architecture
UltraSPARC T1 Core Pipeline • Thread Group shares L1 cache, TLBs, execution units, pipeline registers and data path • Blue areas are replicated copies per hardware context thread D. Kaseridis - Laboratory for Computer Architecture
Objectives • Purpose • Analysis of interference of multiple executing threads on the shared resources of Niagara • Scaling abilities of CMT architectures for both multiprogrammed and multithreaded workloads • Methodology • Interference on Shared Resources (SPEC CPU2000) • Scaling of a Multiprogrammed Workload (SPEC CPU2000) • Scaling of a Multithreaded Workloads (SPECjbb2005) D. Kaseridis - Laboratory for Computer Architecture
Analysis Objectives / Methodology D. Kaseridis - Laboratory for Computer Architecture
Methodology (1/2) • On-chip performance counters for real/accurate results • Niagara: • Solaris10 tools : cpustat, cputrack , psrsetto bind processes to H/W threads • 2 counters per Hardware Thread with one only for Instruction count D. Kaseridis - Laboratory for Computer Architecture
Methodology (2/2) • Niagara has only one FP unit only integer benchmark was considered • Performance Counter Unit in the granularity of a single H/W context thread • No way to break down effects of more threads per H/W thread • Software profiling tools too invasive • Only pairs of benchmarks was considered to allow correlation of benchmarks with events • Many iterations and use average behavior D. Kaseridis - Laboratory for Computer Architecture
Analysis of Results Interference on shared resources Scaling of a multiprogrammed workload Scaling of a multithreaded workload D. Kaseridis - Laboratory for Computer Architecture
Interference on Shared Resources Two modes considered: • “Same core” mode executes a benchmark on the same core • Sharing of pipeline, TLBs, L1 bandwidth • More like an SMT • “Two cores” mode execute each member of pair on a different core • Sharing of L2 capacity/bandwidth and main memory • More like an CMP D. Kaseridis - Laboratory for Computer Architecture
Interference “same core”(1/2) • On average 12% drop of IPC when running in a pair • Crafty followed by twolf showed the worst performance • Eon best behavior keeping the IPC almost close to the single thread case D. Kaseridis - Laboratory for Computer Architecture
Interference “same core”(2/2) D. Kaseridis - Laboratory for Computer Architecture • DC misses increased 20% on average / 15% taking out crafty • Worst DC misses are vortex andperlbmk • Highest ratios of L2 misses demonstrated are not the one that features an important decrease in IPC mcf and eon pairs with more than 70% L2 misses • Overall, small performance penalty even when sharing pipeline and L1, L2 bandwidth latency hiding technique is promising
Interference “two cores” • Only stressing L2 and shared communication buses • On average the misses on L2 are almost the same as in the case on “same core”: • underutilized the available resources • Multiprogrammed workload with no data sharing D. Kaseridis - Laboratory for Computer Architecture
Scaling of Multiprogrammed Workload • Reduced benchmark pair set • Scaling 4 8 16 threads with configurations D. Kaseridis - Laboratory for Computer Architecture
Scaling of Multiprogrammed Workload • “Same core” • “Mixed mode” mode D. Kaseridis - Laboratory for Computer Architecture
Scaling of Multiprogrammed “same core” IPC ratio DC misses ratio • 4 8 case • IPC / Data cache misses not affected • L2 data misses increased but IPC is not • Enough resources running fully occupied • memory latency hiding • 8 16 case • More cores running same benchmark • Some footprint / request to L2 /Main memory • L2 requirements / shared interconnect traffic decreased performance L2 misses ratio D. Kaseridis - Laboratory for Computer Architecture
Scaling of Multiprogrammed “mixed mode” • Mixed mode case • Significant decrease in IPC when moving both • from 4 8 and 8 16 threads • Same behavior as “same core” case for DC • and L2 misses with an average of 1% - 2% • difference • Overall for both modes • Niagara demonstrated that moving from 4 to 16 threads can be done with less than 40% on average performance drop • Both modes showed that significantly increased L1 and L2 misses can be handed favoring throughput IPC ratio D. Kaseridis - Laboratory for Computer Architecture
Scaling of Multithreaded Workload • Scaled from 1 up to 64 threads • 1 8 threads mapped 1 thread per core • 8 16 threads mapped at maximum 2 threads per core • 16 32 threads up to 4 threads per core • 32 64 more threads per core, swapping is necessary Configuration used for SPECjbb2005 D. Kaseridis - Laboratory for Computer Architecture
Scaling of Multithreaded Workload SPECjbb2005 score per warehouse GCeffect D. Kaseridis - Laboratory for Computer Architecture
Scaling of Multithreaded Workload • Ratio over 8 threads case with 1 thread per core • Instruction fetch and DTLB stressed the most • L1 data and L2 Caches managed to scale even for more then 32 threads GCeffect D. Kaseridis - Laboratory for Computer Architecture
Scaling of Multithreaded Workload • Scaling of Performance • Linear scaling of almost 0.66 per thread up to 32 threads • 20x speed up at 32 threads • SMP (2 Threads/core) gives on average 1.8x speed up over the CMP configuration (region 1 • SMT (up to 4 Threads/core) gives a 1.3x and 2.3x speedup over the 2-way SMT per core and the single-threaded CMP, respectively. D. Kaseridis - Laboratory for Computer Architecture
Conclusions • Demonstration of interference on a real CMT system • Long latency hiding technique is effective for L1 and L2 misses and therefore could be a good/promising technique against aggressive speculation • Promising scaling up to 20x for multithreaded workloads with an average of 0.66x per thread • Instruction fetch subsystem and DTLBs the most contented resources followed by L2 cache misses D. Kaseridis - Laboratory for Computer Architecture
Q/A Thank you… Questions? The Laboratory for Computer Architecture web-site: http://lca.ece.utexas.edu Email: kaseridi@ece.utexas.edu D. Kaseridis - Laboratory for Computer Architecture