Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI

Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI Vikas Aggarwal, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida Gainesville, FL

Outline • Introduction • N-Queens Solutions • Backtracking Approach • N-Queens Parallelization • Experimental Setup • Handel-C and Lessons Learned • Results and Analysis • Conclusions • Future Work and Acknowledgements • References

Introduction • N-Queens dates back to the 19th century (studied by Gauss) • Classical combinatorial problem, widely used as a benchmark because of its simple and regular structure • Problem involves placing N queens on an NN chessboard such that no queen can attack any other • Benchmark code versions include finding the first solution and finding all solutions

i = 1 2 3 4 5 BOARD [ ] 1 3 5 2 4 Q Q Q Q Q Introduction • Mathematically stated: Find a permutation of the BOARD() vector containing numbers 1:N, such that for any i != j Board( i ) - i != Board( j ) - j Board( i ) + i != Board( j ) + j

N-Queens Solutions • Various approaches to the problem • Brute force[2] • Local search algorithms[4] • Backtracking[2], [7] , [11], [12], [13] • Divide and conquer approach[1] • Permutation generation[2] • Mathematical solutions[6] • Graph theory concepts[2] • Heuristics and AI[4], [14]

Backtracking Approach • One of the only approaches that guarantees a solution, though it can be slow • Can be seen as a form of intelligent depth-first search • Complexity of backtracking typically rises exponentially with problem size • Good test case for performance analysis of RC systems, as the problem is complex even for small data size* • Traditional processors provide a suboptimal platform for this iterative application due to serial nature of their processing pipelines • Tremendous speedups achieved by adding parallelism at the logic level via RC * For an 8x8 board, 981 moves (876 tests + 105 backtracks) are required for first solution alone

Backtracking Approach # of operations for 1st solution [7] Number of solutions [8] • Tables provide an estimate of the backtracking approach’s complexity • Problem can be made to find first solution or the total number of solutions • Total number of solutions is obviously a more challenging problem • Interesting observation: 1st solution’s complexity (i.e. number of operations) does not increase monotonically with board size

N-Queens Parallelization • Different levels of parallelism added to improve performance • Functional Unit replication • Parallel column check • Parallel row check Q Parallelization Comparison Sequential : 11 cycles Parallel column check: 3 cycles Multiple row check appended:1 cycles 11x speedup over sequential operation Q Note: Assume first four queens have been placed and the fifth queen starts from the 1st row

Experimental setup • Experiments conducted using RC1000 boards from Celoxica, Inc., and Tarari RC boards from Tarari, Inc. • Each RC1000 board features a Xilinx Virtex-2000 FPGA, 8 MB of on-card SRAM, and PCI Mezzanine Card (PMC) sockets for connecting two daughter cards • Each Tarari board features two user-programmable Xilinx Virtex-II FPGAs in addition to a controller FPGA, 256 MB of DDR SDRAM • Configurations designed in Handel-C using Celoxica’s application mapping tool DK-2, along with Xilinx ISE for place and route • Performance compared against 2.4 GHz Xeon server and 1.33 GHz Athlon server

Celoxica RC1000 • PCI-based card having one Xilinx FPGA and four memory banks • FPGA configured from the host processor over the PCI bus • Four memory banks, each of 2MB, accessible to both the FPGA and any other device on the PCI bus • Data transfers: The RC1000 provides 3 methods of transferring data over PCI bus between host processor and FPGA: • Bulk data transfers performed via memory banks • Two unidirectional 8 bit ports, called control and status ports, for direct comm. between FPGA and PCI bus (note: this method used in our experiments) • User I/O pins USER1 and USERO for single bit communication with FPGA • API-layer calls from host to configure and communicate with RC board * Figure courtesy of Celoxica RC1000 manual

Tarari Content Processing Platform • PCI-based board having 3 FPGAs and a 256 MB memory bank • Two Xilinx Virtex-II FPGAs available for user to load configuration files from host over the PCI bus • Each Content Processing Engine or CPE (User FPGA) configured with one or two agents • Third FPGA acts as controller providing high-bandwidth access to memory and configuration of CPP with agents • 256 MB of DDR SDRAM for data sharing between CPEs and the host application • Configuration files first uploaded into the memory slots and used to configure each FPGA • Both single-word transfers and DMA transfers supported between the host and the CPP * Figure courtesy of Tarari CP-DK manual

Handel-C Programming Paradigm • Handel-C acts as a bridge between VHDL and “C” • Comparison with conventional C • More explicit provisioning of parallelism within the code • Variables declared to have the exact bit-lengths to save space • Provides more bit-level manipulations beyond shifts and logic operations • Limited support for many ANSI C standards and extensions • Comparison with VHDL • Application porting is much faster for experienced coders • Similar to VHDL behavioral models • Lacks VHDL concurrent signal assignments which can be suspended until changes on input triggers (Handel-C requires polling) • Provides more higher-level routines

Handel-C Design Specifics • Design makes use of the following two approaches • Approach 1 • Use of an array of binary numbers to hold a ‘1’ at a particular bit position to indicate the location of queen in the column • A 32 x 32 board will require an array of 32 elements of 32 bits each • Correspondingly use bit-shift operations and logical-and operations to check diagonal and row conditions • More closely corresponds to the way the operations will take place on the RC fabric • Approach 2 • Use of an array of integers instead of binary numbers • Correspondingly use the mathematical model of the problem to check the validation conditions • Smaller variables yield better device utilization; slices occupied reduce from about 75% to about 15% for similar performance andparallelism • Approach 2 found to be more amenable for Handel-C designs

Lessons Learned with Handel-C Some interesting observations: • Code for which place and route did not work, finally worked when the function parameters were replaced by global variables • Less control at lower level with place and route being a consistent problem even with designs using up only 40% of total slices • Self-referenced operations (e.g. a=a+x) affect the design adversely, so use intermediate variables • Order of operations and conditional statements can affect design • Useful to reduce wider-bit operations into a sequence of narrower-bit operations • Balancing “if” with “else” branches leads to better designs • Comments in the main program sometimes affected the synthesis, leading to place and route errors in fully commented code • We are still learning more everyday!

Sequential First-Solution Results • Sequential version does not perform well versus the Xeon and Athlon CPUs • Algorithm needs an efficient design to minimize resource utilization • The results do not include the one-time configuration overhead of ~150 ms RC1000 clock speed @ 40 MHz RC1000 clock speed @ 40 MHz

Parallel First-Solution Results RC1000 clock speed @ 25 MHz • The most parallel algorithm runs about 20x faster than sequential algorithm on RC fabric • Parallel algorithm with two row checks almost duplicates behavior of 2.4 GHz Xeon server, while 6-row check outperforms it by 74% • Further increasing the number of rows checked is likely to further improve performance for larger problem sizes

Total Number of Solutions Method • Employ divide-and-conquer approach • Seen as a parallel depth-first search • Solutions obtained with queen positioned in any row in the first column are independent from solutions with queens in other positions • Technique allows for high degree of parallelism (DoP)

One-Board Total-Solutions Results • Designs on hardware perform around 1.7x faster than Xeon server • Performance on both RC platforms similar for same clock rates • RC1000 performs a notch better for smaller chess board sizes while Tarari CPP’s performance improves with chess board sizes • Almost entire Virtex–II chip on the Tarari is occupied for one FU RC1000 and Tarari clock speed @ 33 MHz

Multiple Functional Units (FUs) fu10 fu5 fu4 fu9 • Used additional FUs per chip to increase parallelism per chip • Each FU searches for the number of solutions corresponding to a subset of rows in the first column • The controller • Handles communication with the host • Invokes all FUs in parallel • Combines all results fu3 fu8 fu2 fu7 fu6 fu1 On board FPGA Controller Host processor

Total-Solutions Results with Multiple FUs RC1000 clock speed @ 30 MHz • RC1000 with three FUs performs almost 5x faster than Xeon server • Speedup increases near linearly with number of FUs • Area occupied scales linearly with number of FUs RC speedup vs. Xeon server for board size of 17

MPI for Inter-Board Communication • To further increase system speedup (having more functional units), multiple boards employed • Each FU programmed to search a subset of the solution space • Servers communicate using the Message Passing Interface (MPI) to start search in parallel and obtain the final result On-board FPGA (with one or multiple FU’s) Host server MPI Host server On-board FPGA (with one or multiple FU’s)

Total-Solutions Results with MPI • Results show total execution time including MPI overhead • Minimal MPI overhead incurred (high computation-to-communication ratio) • Communication overhead bounded to 3 ms regardless of problem size and initialization overhead is around 750 ms • Overhead becomes negligible for large problem sizes • Speedup scales near linearly with number of boards • 4-board Tarari design performs about 6.5x faster than Xeon server Tarari CPP clock speed @ 33 MHz RC speedup vs. Xeon server for board size of 12

Total-Solutions Results with MPI • Results show total execution time including MPI overhead • Minimal MPI overhead incurred (high computation-to-communication ratio) • Communication overhead bounded to 3 ms regardless of problem size and initialization overhead is around 750 ms • Overhead becomes negligible for large problem sizes • Speedup scales near linearly with number of boards • 4-board RC1000 design performs about 12x faster than Xeon server RC1000 clock speed @ 30 MHz RC speedup vs. Xeon server for board size of 12

Total-Solutions Results with MPI • Communication overheads still remain low, while MPI initialization overheads increase with number of boards (now 1316 ms for 8 boards) • Heterogeneous mixture of boards employed to solve the problem coordinating via MPI • Total of 8 boards (4 RC1000 and 4 Tarari boards) allows up to 16 (43 + 41) FUs • 8 boards perform about 21x faster than Xeon server for chess board size of 16 • What appears to be an unfair comparison really shows how the approach scales to many more FUs per FPGA (on higher density chips) RC1000 clock speed @ 30 MHz and Tarari clock speed @ 33MHz

Conclusions • Parallel backtracking for solving N-Queens problem in RC shows promise for performance • N-Queens is an important benchmark in the HPC community • RC devices outperform CPUs for N-Queens due to RC’s efficient processing of fine-grained, parallel, bit-manipulation operations • Previously inefficient methods for CPUs like backtracking can be improved by reexamining their design • This approach can be applied to many other applications • Numerous parallel approaches developed at several levels • Handel-C lessons learned • A “C-based” programming model for application mapping provides a degree of higher-level abstraction, yet still requires programmer to code from a hardware perspective • Solutions produced to date show promise for application mapping

Future Work and Acknowledgements • Compare application mappers with HDL design in terms of mapping efficiency • Develop and use direct communication between FPGAs to avoid MPI overhead • Export approach featured in this talk to variety of algorithms and HPC benchmarks for performance analysis and optimization • Develop library of application and middleware kernels for RC-based HPC • We wish to thank the following for their support of this research: • Department of Defense • Xilinx • Celoxica • Tarari • Key vendors of our HPC cluster resources (Intel, AMD, Cisco, Nortel)

References [1] “Divide and Conquer under Global Constraints: A Solution to the N-Queens Problem”, Bruce Abramson and Mordechai M. Yung [2] ”Different Perspectives Of The N-queens Problem”, Cengiz Erbas, Seyed Sarkeshikt, Murat M. Tanik, Department of Computer Science and Engineering,Southern Methodist University, Dallas [3] “Algorithms and Complexity”, Herbert S. Wilf, University of Pennsylvania, Philadelphia [4] “Fast search algorithms for N-Queens problem”, Rok Sausic, Jum Gu, appeared in IEEE transactions on Systems, Man, and Cybernetics, Vol 21, 6, pp 1572-76, Nov/Dec 1991 [5] http://www.cit.gu.edu.au/~sosic/nqueens.html [6] http://bridges.canterbury.ac.nz/features/eight.html [7] www.math.utah.edu/~alfeld/queens/queens.html [8] www.jsomers.com/nqueen_demo/nqueens.html [9] A polynomial time algorithm for N-queens problem [10] remus.rutgers.edu/~rhoads/Code/code.html [11] http://www.mactech.com/articles/mactech/Vol.13/13.12/TheEightQueensProblem/index.html [12] http://www2.ilog.com/preview/Discovery/samples/nqueens/ [13] http://www.infosun.fmi.uni-passau.de/br/lehrstuhl/Kurse/Proseminar_ss01/backtracking_nm.pdf [14] “From Alife Agents To A Kingdom Of N Queens”, Han Jing, Jimimg Liu, Cai Qingsheng [15] http://www.wi.leidenuniv.nl/~kosters/nqueens.html [16] http://www.dsitri.de/projects/NQP/

Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI

Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI

Presentation Transcript

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Hybrid Parallel Programming with MPI and Unified Parallel C

Parallel Programming in C with MPI and OpenMP

Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP