1 / 35

Evaluating Multi-threading in the Prototype XMT Environment

Evaluating Multi-threading in the Prototype XMT Environment. Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, Uzi Vishkin Department of Computer Science University of Maryland, College Park. Introduction. 1 billion transistor chip around the corner instruction-level parallelism (ILP)

zeal
Download Presentation

Evaluating Multi-threading in the Prototype XMT Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Multi-threading in the Prototype XMT Environment Dorit Naishlos, Joseph Nuzman, Chau-Wen Tseng, Uzi Vishkin Department of Computer Science University of Maryland, College Park

  2. Introduction • 1 billion transistor chip around the corner • instruction-level parallelism (ILP) • single-chip multiprocessing (CMP) • simultaneous multi-threading (SMT) • Bridge the gap between on-chip parallelism and the parallel program • XMT: computational framework that encompasses both architecture design and programmability • programming model targets on-chip parallelism • explicit parallelism • fine granularity • scalability • wider range of applications

  3. multiprogramming, IPC compiler threading, speculation XMT (Explicit Multi-Threading) • Goal: • single task completion time • speedups of parallel program over best serial program • Framework: • algorithm programming hardware • Architecture: CMP consisting of SMT units • scalability • high-bandwidth communication • efficient synchronization • Hardware prefix-sum primitive • Explicit parallel programming model • target general applications

  4. Contributions • Prototype XMT environment • compiler • simulator • Experimental evaluation • wide range of applications (12 benchmarks) • parallel speedups over serial program • parallel applications: scalability to high levels • speedups for less parallel, irregular applications • Compiler optimization • thread coarsening

  5. Outline • Motivation • XMT Programming Model • XMT Architecture • XMT Compiler • Experimental Evaluation • Conclusion

  6. spawn (nthreads, off); { … xfork(); } • ps (base, incr) • base base + incr • return: initial value of base XMT Programming Model • Explicit spawn-join parallel regions • Independence of Order Semantics (IOS) • threads run to completion at own speed • no busy wait • Spawn statement - to generate a parallel region • Prefix-sum operation - for synchronization • Fork - dynamically increment the spawn size

  7. 5 “left” partition(lower than pivot) “right” partition(higher than pivot) 1 3 4 2 Example - Quicksort quicksort(input,n){ while(…){ partition(input,output); swap(input,output); } } Partition(input,output,n){ int pivot = p, low = 0, high = n; spawn(n, 0); { int indx; if (input[TID] < pivot){ indx = ps(low,1); output[indx] = input[TID]; }else{ indx = ps(hi,1); output[n-indx]=input[TID]; } } join(); }

  8. Outline • Motivation • XMT Programming Model • XMT architecture • XMT Compiler • Experimental Evaluation • Conclusion

  9. XMT Architecture • Goals • exploit explicit parallelism • simplify hardware • maximize resource usage • decentralize design

  10. XMT Architecture • Thread control units (TCU) • PCs, instruction fetch/decode, local registers • Clusters • multiple TCUs, L1 caches, functional units

  11. TCU 1 TCU 2 TCU 3 TCU 4 TCU 5 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 XMT Execution Model • Serial mode: TCU 0 • spawn all TCUs join TCU 0 Spawn(10, 0); TCU 0 waitingwaiting Serial code Serial code Thread 8 Thread 6 Thread 7 Thread 9

  12. XMT Architecture • Global functions • banked memory system • specialized global registers (pread,pset) • spawn unit(spawn) • prefix-sum unit (pinc) • Parallel prefix-sums efficiently combined

  13. Outline • Motivation • XMT Programming Model • XMT architecture • XMT Compiler • Experimental Evaluation • Conclusion

  14. XMT Compiler • Front End: XPASS - SUIF pass • XMT-C C + specialized templates • parallel region separate procedure • assembly constructs for parallel execution • Supports thread coarsening • Back End: GCC • produces Simplescalar MIPS ISA

  15. int TID, max_tid; TID = TCUID + offset; while(TID < max_tid){ }; Compilation Scheme • Phase 3: Templates replaced with XMT assembly • Phase 2: Spawn_function transformation • Phase 1: Outlining spawn_0_func() { int TID, max_tid; TID = TCUID + offset; while(TID < max_tid){ T H R E A D C O D E; }; } Main() { int global_vars; print(data); } pread PR0; pread ... TCU-init(nthreads, off); Spawn (nthreads,off); { THREAD CODE; } join(); spawn_0_func(); Spawn (nthreads,off); { THREAD CODE; } join(); spawn_setup(nthreads, off); pset PR0, pset ... TID = get_new_id() pinc PR1 $tid Spawn-end(); halt/suspend;

  16. Outline • Motivation • XMT Programming Model • XMT architecture • XMT Compiler • Experimental Evaluation • Conclusion

  17. Experimental Methodology • Simulator • SimpleScalar parameters for instruction latencies • 1, 4, 16, 64, 256 TCUs • Configuration: • 8 TCUs per cluster • 8K L1 cache • banked shared L2 cache 1MB • Programs rewritten in XMT • Speedups of parallel XMT program compared to best serial program • parallel applications: scalability to high levels • speedups for less parallel, irregular applications

  18. First Application Set • Computation: • regular, • mostly array based, • limited synchronization needed

  19. Speedups over serial • Speedups scale up to 256 TCUs • memory • overheads • coarsening

  20. Memory

  21. Overheads • Problem size less than 0.01% of total execution time

  22. Overheads • Problem size • Extremely-fine granularity

  23. Thread clustering impact on overheads 64 X 64 128 X 128 256 X 256 problem size:

  24. Second Application Set • Computation: • irregular, • unpredictable • synchronization needed

  25. Speedups over serial • Dynamic load balancing • Dynamic forking • Exploiting fine-granularity

  26. Dynamic Load Balancing • dag (initial step of computation) • 256 nodes, 9679 edges • A spawn block, 16 TCUS

  27. Fork

  28. Conclusion • XMT as a complete environment • Extensive experimental evaluation on a range of applications and computations • speedups scale up to 256 TCUs for parallel applications • better speedups for less parallel applications

  29. Related Work • On Chip: • SMT • CMP • MultiScalar • M-Machines • Raw • Multithreaded architectures: • Tera

  30. Summary

  31. Current & Future Work • Compiler optimizations • Enlarge benchmark suite • Detailed simulator

  32. Example - dot product • dot(A,B,lb,ub){ • int dot; • for (i = lb; i < ub ; i++){ • dot += A[i]*B[i]; • } • return dot; • } int input, global_val = 0; spawn(nthreads, 0); { int lb = N*TID/nthreads; int ub = N*(TID+1)/nthreads; int my_part; my_part = dot(input,lb,ub); ps(&global_val, my_part); } join();

  33. Fork operation. Example - Quick-Sort (2) int input[N],thread_data[N]; fspawn(1, 0); { int my_size/start = thread_data[TID]; while(my_size > 1){ int pivot = f(my_size); int low, high = g(my_start,my_size); hi/low_partition <-- ser_partition(); xfork(thread_data, high_partition); my_size = high - my_start; } } join();

  34. Clustering heuristics (jacobi 64 TCUs)

More Related