270 likes | 376 Views
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments. Junghee Lee, Chrysostomos Nicopoulos , Yongjae Lee, Hyung Gyu Lee and Jongman Kim. Presented by Junghee Lee. Introduction. Manycore systems Number of cores is increasing
E N D
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, ChrysostomosNicopoulos, Yongjae Lee, HyungGyu Lee and Jongman Kim Presented by Junghee Lee
Introduction • Manycore systems • Number of cores is increasing • Challenges in scalability • Memory • Power consumption • Cache coherence protocol • Load balancing
Contents • Introduction • Background • Programming models • Motivation • IsoNet • Fault-tolerance • Evaluation • Conclusion
Programming Models • Parallel programming models • MPI • OpenMP • Fine-grained parallelism • Emerging applications:Recognition, Mining and Synthesis • Execution time of each computation kernel is very short but it has abundant parallelism • Excessive overhead in multithreading
Job Queuing • Creates jobs instead of threads • One thread per core is created • Thread: a set of instructions and states of execution • Job: a set of data that is processed by a thread • Job queue • Manages the list of jobs • Maintains load balance Job Job Job Thread Thread CPU CPU
Conflicts in Job Queue • Chance of conflicts increases as: • The number of cores increases • The time taken to update the job queue increases • The job queue is accessed more frequently (job is short) • Previous approaches • Distributed queues • Load balance is maintained by job-stealing • The chance of conflicts in one local queue is decreased • Hardware implementation • Time spent on updating the queue is reduced
Profile of SMVM Conflicts Stealing job Processing job 1.0 Ratio of execution time 0.8 0.6 0.4 0.2 0 4 128 256 8 16 32 64 Number of cores
Objectives • Requirements of load balancer • Scalability: conflict-free • Fault-tolerance • The probability of faults increases exponentially as technology scales • Contributions of this paper • Light weight micro-network for load balancing • Scalable even with more than a thousand cores • Comprehensive fault-tolerance support
Contents • Introduction • Background • IsoNet • Architecture • Implementation • Fault-tolerance • Evaluation • Conclusion
System View I I I CPU CPU CPU R R R I I I CPU CPU CPU R R R
Microarchitecture of IsoNet Node Job Count Job Count MUX MUX Max Selector Min Selector Comp Comp Switch MUX Job Job DEMUX Dual Clock Stack
How It Works 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 Tree-based routing: for fault-tolerance
Single Cycle Implementation • Estimated critical path delay • 11.38 ns (87.8 MHz) • By Elmore delay model • Single cycle implementation offers low hardware cost Leaf node Int. node Root node Int. node Src or Dest Swt Swt Src node Dest node
Hardware Cost Estimation 674.50 * 240 * 4 = 647.52 K = 0.046% of 1.4 B (NVIDIA GTX285)
Contents • Introduction • Background • IsoNet • Fault-tolerance • Transparent mode • Reconfiguration mode • Evaluation • Conclusion
Supporting Fault-Tolerance • Transparent mode • For faulty CPUs • Bypass the corresponding IsoNet node • Reconfiguration mode • For faulty IsoNet node • Operation • When a fault is detected, all IsoNet nodes go into the reconfiguration mode • Reconfigure the topology of IsoNet so that the faulty node is excluded • Assign a new root node if the root node fails
Reconfiguration 3 3 3 2 3 1 3 3 2 2 2 1 3 0 3 1 3 2 3 3 3 2 3 1 3 2 3 3 Root Node Candidate
Contents • Introduction • Background • IsoNet • Fault-tolerance • Evaluation • Experimental setup • Results • Conclusion
Experimental Setup • Simulation framework • Wind River’s Simics full-system simulator • CMP with 4~64 x86 compatible cores • Fedora 12 with kernel 2.6.33 • Benchmarksfrom recognition, mining and synthesis applications • GS: Gauss-Seidel • MMM: Dense Matrix-Matrix Multiply • SVA: Scaled Vector Addition • MVM: Dense Matrix Vector Multiply • SMVM: Sparse Matrix Vector Multiply
Results MMM (6,473 instructions) SMVM (2,872 instructions) 50 25 7 14 Execution time (107 cycles) Execution time (107 cycles) 45 6 12 40 20 Speed up 5 Speed up 10 35 15 30 8 4 25 6 3 10 20 2 4 15 5 1 10 2 5 0 0 4 8 16 32 64 4 8 16 32 64 Number of cores Number of cores Job stealing Carbon IsoNet IsoNet speed up Carbon speedup
Beyond Hundred Cores • MMM (6,473 instructions) 1.0 Relative Execution Time 0.8 0.6 0.4 0.2 0 128 4 8 16 32 64 256 512 1024 Number of cores Carbon IsoNet
Profile of IsoNet Conflicts Stealing job Processing job 1.0 Ratio of execution time 0.8 0.6 0.4 0.2 0 4 8 16 32 64 Number of cores
Conclusion • Scalability is one of key challenges in manycore domain • Scalability in load balancing is critical to utilize a number of processing elements • This paper proposes a novel hardware-based dynamic load distributor and balancer, called IsoNet • IsoNet also provides comprehensive fault-tolerance support • Experimental results in a full-system simulation with real applications demonstrate that IsoNet scales better than alternative techniques
Questions? Contact info Junghee Lee junghee.lee@gatech.edu Electrical and Computer Engineering Georgia Institute of Technology