Evaluating Multi-threading in the Prototype XMT Environment

Evaluating Multi-threading in the Prototype XMT Environment Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, Uzi Vishkin Department of Computer Science University of Maryland, College Park

Introduction • 1 billion transistor chip around the corner • instruction-level parallelism (ILP) • single-chip multiprocessing (CMP) • simultaneous multi-threading (SMT) • Bridge the gap between on-chip parallelism and the parallel program • XMT: computational framework that encompasses both architecture design and programmability • programming model targets on-chip parallelism • explicit parallelism • fine granularity • scalability • wider range of applications

multiprogramming, IPC compiler threading, speculation XMT (Explicit Multi-Threading) • Goal: • single task completion time • speedups of parallel program over best serial program • Framework: • algorithm programming hardware • Architecture: CMP consisting of SMT units • scalability • high-bandwidth communication • efficient synchronization • Hardware prefix-sum primitive • Explicit parallel programming model • target general applications

Contributions • Prototype XMT environment • compiler • simulator • Experimental evaluation • wide range of applications (12 benchmarks) • parallel speedups over serial program • parallel applications: scalability to high levels • speedups for less parallel, irregular applications • Compiler optimization • thread coarsening

Outline • Motivation • XMT Programming Model • XMT Architecture • XMT Compiler • Experimental Evaluation • Conclusion

spawn (nthreads, off); { … xfork(); } • ps (base, incr) • base base + incr • return: initial value of base XMT Programming Model • Explicit spawn-join parallel regions • Independence of Order Semantics (IOS) • threads run to completion at own speed • no busy wait • Spawn statement - to generate a parallel region • Prefix-sum operation - for synchronization • Fork - dynamically increment the spawn size

5 “left” partition(lower than pivot) “right” partition(higher than pivot) 1 3 4 2 Example - Quicksort quicksort(input,n){ while(…){ partition(input,output); swap(input,output); } } Partition(input,output,n){ int pivot = p, low = 0, high = n; spawn(n, 0); { int indx; if (input[TID] < pivot){ indx = ps(low,1); output[indx] = input[TID]; }else{ indx = ps(hi,1); output[n-indx]=input[TID]; } } join(); }

Outline • Motivation • XMT Programming Model • XMT architecture • XMT Compiler • Experimental Evaluation • Conclusion

XMT Architecture • Goals • exploit explicit parallelism • simplify hardware • maximize resource usage • decentralize design

XMT Architecture • Thread control units (TCU) • PCs, instruction fetch/decode, local registers • Clusters • multiple TCUs, L1 caches, functional units

TCU 1 TCU 2 TCU 3 TCU 4 TCU 5 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 XMT Execution Model • Serial mode: TCU 0 • spawn all TCUs join TCU 0 Spawn(10, 0); TCU 0 waitingwaiting Serial code Serial code Thread 8 Thread 6 Thread 7 Thread 9

XMT Architecture • Global functions • banked memory system • specialized global registers (pread,pset) • spawn unit(spawn) • prefix-sum unit (pinc) • Parallel prefix-sums efficiently combined

XMT Compiler • Front End: XPASS - SUIF pass • XMT-C C + specialized templates • parallel region separate procedure • assembly constructs for parallel execution • Supports thread coarsening • Back End: GCC • produces Simplescalar MIPS ISA

int TID, max_tid; TID = TCUID + offset; while(TID < max_tid){ }; Compilation Scheme • Phase 3: Templates replaced with XMT assembly • Phase 2: Spawn_function transformation • Phase 1: Outlining spawn_0_func() { int TID, max_tid; TID = TCUID + offset; while(TID < max_tid){ T H R E A D C O D E; }; } Main() { int global_vars; print(data); } pread PR0; pread ... TCU-init(nthreads, off); Spawn (nthreads,off); { THREAD CODE; } join(); spawn_0_func(); Spawn (nthreads,off); { THREAD CODE; } join(); spawn_setup(nthreads, off); pset PR0, pset ... TID = get_new_id() pinc PR1 $tid Spawn-end(); halt/suspend;

Experimental Methodology • Simulator • SimpleScalar parameters for instruction latencies • 1, 4, 16, 64, 256 TCUs • Configuration: • 8 TCUs per cluster • 8K L1 cache • banked shared L2 cache 1MB • Programs rewritten in XMT • Speedups of parallel XMT program compared to best serial program • parallel applications: scalability to high levels • speedups for less parallel, irregular applications

First Application Set • Computation: • regular, • mostly array based, • limited synchronization needed

Speedups over serial • Speedups scale up to 256 TCUs • memory • overheads • coarsening

Memory

Overheads • Problem size less than 0.01% of total execution time

Overheads • Problem size • Extremely-fine granularity

Thread clustering impact on overheads 64 X 64 128 X 128 256 X 256 problem size:

Second Application Set • Computation: • irregular, • unpredictable • synchronization needed

Speedups over serial • Dynamic load balancing • Dynamic forking • Exploiting fine-granularity

Dynamic Load Balancing • dag (initial step of computation) • 256 nodes, 9679 edges • A spawn block, 16 TCUS

Fork

Conclusion • XMT as a complete environment • Extensive experimental evaluation on a range of applications and computations • speedups scale up to 256 TCUs for parallel applications • better speedups for less parallel applications

Related Work • On Chip: • SMT • CMP • MultiScalar • M-Machines • Raw • Multithreaded architectures: • Tera

Summary

Current & Future Work • Compiler optimizations • Enlarge benchmark suite • Detailed simulator

Example - dot product • dot(A,B,lb,ub){ • int dot; • for (i = lb; i < ub ; i++){ • dot += A[i]*B[i]; • } • return dot; • } int input, global_val = 0; spawn(nthreads, 0); { int lb = N*TID/nthreads; int ub = N*(TID+1)/nthreads; int my_part; my_part = dot(input,lb,ub); ps(&global_val, my_part); } join();

Fork operation. Example - Quick-Sort (2) int input[N],thread_data[N]; fspawn(1, 0); { int my_size/start = thread_data[TID]; while(my_size > 1){ int pivot = f(my_size); int low, high = g(my_start,my_size); hi/low_partition <-- ser_partition(); xfork(thread_data, high_partition); my_size = high - my_start; } } join();

Clustering heuristics (jacobi 64 TCUs)

Evaluating Multi-threading in the Prototype XMT Environment