Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, ChrysostomosNicopoulos, Yongjae Lee, HyungGyu Lee and Jongman Kim Presented by Junghee Lee

Introduction • Manycore systems • Number of cores is increasing • Challenges in scalability • Memory • Power consumption • Cache coherence protocol • Load balancing

Contents • Introduction • Background • Programming models • Motivation • IsoNet • Fault-tolerance • Evaluation • Conclusion

Programming Models • Parallel programming models • MPI • OpenMP • Fine-grained parallelism • Emerging applications:Recognition, Mining and Synthesis • Execution time of each computation kernel is very short but it has abundant parallelism • Excessive overhead in multithreading

Job Queuing • Creates jobs instead of threads • One thread per core is created • Thread: a set of instructions and states of execution • Job: a set of data that is processed by a thread • Job queue • Manages the list of jobs • Maintains load balance Job Job Job Thread Thread CPU CPU

Conflicts in Job Queue • Chance of conflicts increases as: • The number of cores increases • The time taken to update the job queue increases • The job queue is accessed more frequently (job is short) • Previous approaches • Distributed queues • Load balance is maintained by job-stealing • The chance of conflicts in one local queue is decreased • Hardware implementation • Time spent on updating the queue is reduced

Profile of SMVM Conflicts Stealing job Processing job 1.0 Ratio of execution time 0.8 0.6 0.4 0.2 0 4 128 256 8 16 32 64 Number of cores

Objectives • Requirements of load balancer • Scalability: conflict-free • Fault-tolerance • The probability of faults increases exponentially as technology scales • Contributions of this paper • Light weight micro-network for load balancing • Scalable even with more than a thousand cores • Comprehensive fault-tolerance support

Contents • Introduction • Background • IsoNet • Architecture • Implementation • Fault-tolerance • Evaluation • Conclusion

System View I I I CPU CPU CPU R R R I I I CPU CPU CPU R R R

Microarchitecture of IsoNet Node Job Count Job Count MUX MUX Max Selector Min Selector Comp Comp Switch MUX Job Job DEMUX Dual Clock Stack

How It Works 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 Tree-based routing: for fault-tolerance

Single Cycle Implementation • Estimated critical path delay • 11.38 ns (87.8 MHz) • By Elmore delay model • Single cycle implementation offers low hardware cost Leaf node Int. node Root node Int. node Src or Dest Swt Swt Src node Dest node

Hardware Cost Estimation 674.50 * 240 * 4 = 647.52 K = 0.046% of 1.4 B (NVIDIA GTX285)

Contents • Introduction • Background • IsoNet • Fault-tolerance • Transparent mode • Reconfiguration mode • Evaluation • Conclusion

Supporting Fault-Tolerance • Transparent mode • For faulty CPUs • Bypass the corresponding IsoNet node • Reconfiguration mode • For faulty IsoNet node • Operation • When a fault is detected, all IsoNet nodes go into the reconfiguration mode • Reconfigure the topology of IsoNet so that the faulty node is excluded • Assign a new root node if the root node fails

Reconfiguration 3 3 3 2 3 1 3 3 2 2 2 1 3 0 3 1 3 2 3 3 3 2 3 1 3 2 3 3 Root Node Candidate

Contents • Introduction • Background • IsoNet • Fault-tolerance • Evaluation • Experimental setup • Results • Conclusion

Experimental Setup • Simulation framework • Wind River’s Simics full-system simulator • CMP with 4~64 x86 compatible cores • Fedora 12 with kernel 2.6.33 • Benchmarksfrom recognition, mining and synthesis applications • GS: Gauss-Seidel • MMM: Dense Matrix-Matrix Multiply • SVA: Scaled Vector Addition • MVM: Dense Matrix Vector Multiply • SMVM: Sparse Matrix Vector Multiply

Results MMM (6,473 instructions) SMVM (2,872 instructions) 50 25 7 14 Execution time (107 cycles) Execution time (107 cycles) 45 6 12 40 20 Speed up 5 Speed up 10 35 15 30 8 4 25 6 3 10 20 2 4 15 5 1 10 2 5 0 0 4 8 16 32 64 4 8 16 32 64 Number of cores Number of cores Job stealing Carbon IsoNet IsoNet speed up Carbon speedup

Beyond Hundred Cores • MMM (6,473 instructions) 1.0 Relative Execution Time 0.8 0.6 0.4 0.2 0 128 4 8 16 32 64 256 512 1024 Number of cores Carbon IsoNet

Profile of IsoNet Conflicts Stealing job Processing job 1.0 Ratio of execution time 0.8 0.6 0.4 0.2 0 4 8 16 32 64 Number of cores

Conclusion • Scalability is one of key challenges in manycore domain • Scalability in load balancing is critical to utilize a number of processing elements • This paper proposes a novel hardware-based dynamic load distributor and balancer, called IsoNet • IsoNet also provides comprehensive fault-tolerance support • Experimental results in a full-system simulation with real applications demonstrate that IsoNet scales better than alternative techniques

Questions? Contact info Junghee Lee junghee.lee@gatech.edu Electrical and Computer Engineering Georgia Institute of Technology

Thank you!

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments

Presentation Transcript

A Comparison of Load-based and Queue-based Active Queue Management Algorithms

Scheduling and queue management

Exploring Multicore-based Hardware/Software Architectures for Mobile Edge Computing Device

Application Environment Load Balancing Job and Queue Management

Scalability-Based Manycore Partitioning

Simulation and Evaluation Framework for Manycore Architectures

Queue Management

Queuing and Queue Management

PLANNING FOR WIND AND QUEUE MANAGEMENT

Database Architectures for New Hardware

Hardware Transactional Memory for GPU Architectures*

Queue Management

Scheduling and queue management

MPI for MultiCore and ManyCore

Hardware Transactional Memory for GPU Architectures

Virtual Environments : System Architectures

Hardware Architectures for Power and Energy Adaptation

Hardware Transactional Memory for GPU Architectures*

Scheduling and queue management

Scheduling and queue management

Queue-Based Algorithms