CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)

10th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-10) CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu

Outline • Brief Description of UltraSPARC T1 • Objectives • SpecJbb2005 Benchmark • Results Laboratory for Computer Architecture

A new multi-threaded processor that combines CMP & SMT in CMT 8 cores with each one handling 4 hardware context threads 32 active hardware context threads Simple in-order pipeline with no branch prediction unit per core Optimized for multithreaded performance  Throughput High throughput  hide the memory and pipeline stalls/latencies by scheduling other threads with Zero cycle thread switch penalty UltraSPARC T1 Laboratory for Computer Architecture

SMP vs. CMT Laboratory for Computer Architecture

UltraSPARC T1 Core Pipeline • Thread Group shares L1 cache, TLBs, execution units, pipeline registers and datapath • Core area = 11 mm2 (90 nm technology) • 4 way MT adds ~ 20% area to core Laboratory for Computer Architecture

Objectives • Evaluate CMP/CMT benefits • Quantify the benefits that additional cores and/or additional hardware threads on a multithreaded environment • Show effectiveness of latency hiding Laboratory for Computer Architecture

Characteristics Model a self contained 3-tier system: Server, Database and Clients Every warehouse is a collection of Java objects with ~25MB of data Each client is represented by an individual thread No I/O effects Reported score: Billion of Operations per Second (BOPS) Targets performance of CPUs, caches, memory hierarchy and the scalability of shared memory processors Stresses the implementations of: JVM (Java Virtual Machine), JIT (Just-In-Time) compiler, garbage collection and threads SPECjbb 2005 Benchmark SPECjbb2005 3-tier architecture Laboratory for Computer Architecture

Experimental parameters Parameters Laboratory for Computer Architecture

On-chip performance counters for real/accurate results Niagara: Solaris10 tools : cpustat, cputrack 2 counters per Hardware Thread with one only for Instruction count Measurements Methodology Laboratory for Computer Architecture

Results – Latency hiding pay off Single core execution using 4 threads on one core Single Thread Execution on T1 SpecJbb Score (BOPS) X2 instead of 4 SpecJbb Score (BOPS) Number of Warehouses Number of Warehouses Laboratory for Computer Architecture

CMP / CMT Scaling – CMP benefits 8 corex 1 thread/cores SpecJbb Score (BOPS) Number of Warehouses Laboratory for Computer Architecture

CMP / CMT Scaling – CMT benefits 8 corex 2 threads/cores SpecJbb Score (BOPS) Number of Warehouses • 75% of the benefit of adding a single core • Significant less area and power requirements (remember that 4 way MT adds ~ 20% area to each core) Laboratory for Computer Architecture

CMP / CMT Scaling – SMT benefits 8 corex 4 threads/cores SpecJbb Score (BOPS) Number of Warehouses Laboratory for Computer Architecture

CMP / CMT Scaling – SMT benefits SpecJbb Score (BOPS) Number of Warehouses • Additional hardware threads > 2 give an additional benefit of 45% • Gradually diminishing returns in terms of SMT efficiency • Garbage collector significantly effects regions 4 and 5 Laboratory for Computer Architecture

SPECjbb Score Scaling IPC of three configurations Best case SPECjbb score speedup IPC Norm. SPECjbb score Number of Virtual Processors Laboratory for Computer Architecture

Conclusions • Throughput vs. Latency in multiprocessing/multithreaded environments • Latency hiding is a good/promising technique against aggressive speculation • Adding SMT can give up to 75% the benefit of CMP with significant less cost • Moving to higher levels of SMT shows diminishing returns  tradeoffs between #cores and #Hardware threads per core Laboratory for Computer Architecture

Thank you… Questions?? The Laboratory for Computer Architecture Web-site: http://lca.ece.utexas.edu Laboratory for Computer Architecture

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)

Presentation Transcript

CMP Synchronization

Presenter: Amanda Gourgue, CMP

Connected Math (CMP)

CMP Design Choices

CMP 101 “Top Tens”

Congestion Management Process (CMP)

CMP and High Technologies

CLEAN MILK PRODUCTION (CMP)

Eric Rozenberg, CMM, CMP

COMMUNITY MORTGAGE PROGRAM (CMP)

CMP Transparency Requirements

Entity Beans -) cmp -) bmp

CMP L2 Cache Management

CMP

CMP Interop Project

CMP Presentation

CMP: Control Monitor Processor

CMP Design Choices

CMP 338

COMMUNITY MORTGAGE PROGRAM (CMP)

Decision 5/CMP.4

CMP at Tufts