270 likes | 473 Views
Cellular Neural Networks Training and Implementation on a GPU Cluster. Bipul Luitel , Cameron Johnson & Sinchan Roychowdhury Real-Time Power and Intelligent Systems Lab, Missouri S&T, Rolla, MO. Outline .
E N D
Cellular Neural Networks Training and Implementation on a GPU Cluster BipulLuitel, Cameron Johnson & SinchanRoychowdhury Real-Time Power and Intelligent Systems Lab, Missouri S&T, Rolla, MO CS387 Spring 2011, April 26, 2011
Outline • Cellular Neural Networks • Architecture • Application – Wide Area Monitoring • Training a neural network using particle swarm optimization • PSO - Introduction • Implementation • Parallelization – Concurrencies in CNN and PSO • GPU Computing in RTPIS Lab • Architecture of the RTPIS GPU cluster • Some timing comparisons • Discussion
Cellular Neural Networks • Cellular Neural Networks (CNN) • A class of neural networks architecture apart from feedforward and feedback architecture. • Traditionally, each element of the network (cell) is a computational unit. • A different variation consists of neural networks (NN) on each cell connected to each other in different fashion depending on the problem. • Each cell is a feedforward or feedback neural network. • Output of one NN (cell) may be connected to the input(s) of one or more neighboring cells, thus forming an iterative feedback loop during training.
Application – Wide Area Monitoring in Power Systems • Smart grid is a complex distributed cyber-physical power and energy system • Communication, computation and control • Remote monitoring of distributed systems is necessary to assess the status of the system • Wide area monitoring • Monitor status of generators – speed deviation • Monitor status of buses – bus voltage • Assist in predictive state estimation in order to provide real-time control • Becomes a challenge when the size of the network is large with many parameters to be monitored • Generator speed deviation predictions using cellular neural networks
Implementation of CNN based WAM Wide Area Monitoring System for Twelve-bus System Wide Area Monitoring System for Two-Area Four-Machine System
Implementation • Architecture of the CNN can be problem dependent.
Training • Training of CMLP is a challenge due to iterative feedback. Initialize weights For each training sample For each cell Train the neural network
Training • Training in parallel using particle swarm optimization.
Outline • Cellular Neural Networks • Architecture • Application – Wide Area Monitoring • Training a neural network using particle swarm optimization • PSO - Introduction • Implementation • Parallelization – Concurrencies in CNN and PSO • GPU Computing in RTPIS Lab • Architecture of the RTPIS GPU cluster • Some timing comparisons • Discussion
Particle Swarm Optimization • Population-based search algorithm • Many points tested • Form of guided search • Sound familiar? • GAs are also population-based • Darwinian evolution • PSO assumes flock or swarm, instead • Goals scattered about a search space • More searchers means they’re found faster
PSO Algorithm • Define the objective function • Identify the independent variables • These make up the coordinate system of the search hyperspace • Initialize a population of n particles with random coordinates and velocities • Find their fitnesses • Record personal best • Record global best • Update velocities • Update locations • Find new fitnesses • Update personal best • Update global best • Run until termination conditions reached • Minimum acceptable fitness reached • Maximum number of fitness tests reached • Maximum real-world time allotted passed f=3 y f=6 f=0 f=1 f=1 f=0 f=0 f=5 x
Parallel PSO Implementation • Obvious concurrency • each particle on a node • Each particle’s fitness is independent • Particles can update velocities and positions concurrently • What’s the catch? • Communication overhead • Have to share fitness information to determine new gbest
Finds local bests PSO Topology and Cluster Topology • How do the particles communicate? • Consider your hardware • Particles on the same node (if #particles > #nodes) can have full connection • Particles on adjacent nodes can communicate in O(1) • Particles on further nodes require more hops • Local Best: best amongst all neighbors ring topology Multi-step to find Global best Global bests wheel topology star topology
Training a CNN with a PSO Using Parallel Processing • Concurrency: each particle of the PSO • Implement the entire CNN on each node, and treat each node as a particle • Concurrency: NNs that make up the CNN • Implement the CNN in parallel • One NN per node; operate independently during a single time step • Communication between NNs is sparse • Have each candidate weight set tested sequentially on the node holding a given NN • With arbitrary nodes? • Take advantage of both concurrencies! • Each node receives one cell • A CNN takes c nodes, where c is the number of cells • For n particles, the PSO then uses n x c nodes P1 P2 P3 P4 NN1 NN2 NN3 NN4 P5 P6 P7 P8 NN5 NN6 NN7 NN8 P9 P10 P11 P12 NN9 NN 10 NN 11 NN 12 P14 P15 P16 NN 14 NN 15 NN 16 P13 NN 13
Outline • Cellular Neural Networks • Architecture • Application – Wide Area Monitoring • Training a neural network using particle swarm optimization • PSO - Introduction • Implementation • Parallelization – Concurrencies in CNN and PSO • GPU Computing in RTPIS Lab • Architecture of the RTPIS GPU cluster • Some timing comparisons • Discussion
Architecture of RTPIS Cluster • Hardware Configuration • Nodes: 1 (master )+ 16 (nodes) = 17 • CPU: 17 (nodes) x 2 (CPUs/node) x 4 (cores/CPU) x 2 (threads/core) = 272 • GPU: 16 (nodes) x 2 (NVIDIA Tesla C2050 GPUs/node) x 448 (CUDA cores/GPU) = 14336 • Memory: 17 (nodes) x 12 GB = 204 GB • Storage: 2 x 500 GB (OS) + 10 x 2 TB (Master) + 16 x 500 GB (nodes) = 29 TB • Software: • Operating System:OpenSUSE 11 Linux • Others: Torque, Maui scheduler, CUDA toolkit/libraries and GNU compilers, C/C++ with MPI libraries, MATLAB Distributed Computing Server
GPU Computing - Matlab • GPU computing is built in MATLAB R2010b • Run part of the code in GPU using MATLAB • Use gpuArray or arrayfun commands in MATLAB • Use compiled CUDA code as PTX file to use in MATLAB • Only useful when computing time in CPU exceeds the communication time for transferring variables between CPU and GPU.
Matlab Commands for working on GPU • Create a variable in CPU and move to GPU • gpuVar= gpuArray(cpuVar); • Create a variable in GPU directly • gpuVar = gpu.parallel.GPUArray.zeros(5,10); • Any function that uses GPU variables is performed on a GPU • gpuVar = fft(gpuVar); gpuVar = abs(gpuVar1*gpuVar2); • Use arrayfun to perform operation on a GPU • gpuVar = arrayfun(@min,gpuVar1); • gpuVar = arrayfun(@customFn,gpuVar1,cpuVar); • cpuVar = arrayfun(@customFn,cpuVar1,cpuVar2,cpuVar3) • Move a variable from GPU to CPU: • cpuVar = gather(gpuVar);
Parallel Programming Introduction – MATLAB MATLAB PCT MATLAB PCT MATLAB PCT • MPI_C vs MATLAB parallel • computing C: MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Bcast(buffer,count,datatype,root,comm); MPI_Reduce(sendbuf,recvbuf,count,datatype,op,root,comm); MPI_Barrier(comm) MATLAB: labindex labBroadcast(source,value), labBroadcast(source) gop(@function,value) labBarrier; Matlab Job Manager lab lab lab lab lab lab lab lab
Examples - implementation • Sequential Neural Networks training using PSO Initialization P1 P2 P3 P4 For each iteration P5 P6 P7 P8 For each particle P9 P10 P11 P12 Fitness evaluation P13 P14 P15 P16 gbest update Update position and velocity Dimensions = number of weights Fitness = F(NN output)
Examples - implementation • Parallel Neural Network training using PSO x,v, pbest_pos, pbest_fit x,v, pbest_pos, pbest_fit Initialization x,v, pbest_pos, pbest_fit x,v, pbest_pos, pbest_fit P1 P2 P3 P4 For each iteration x,v, pbest_pos, pbest_fit P5 P6 P7 P8 x,v, pbest_pos, pbest_fit Fitness evaluation x,v, pbest_pos, pbest_fit P9 P10 P11 P12 x,v, pbest_pos, pbest_fit Synchronize gbest update P13 P14 P15 P16 x,v, pbest_pos, pbest_fit Update position and velocity x,v, pbest_pos, pbest_fit gbest_pos gbest_fit x,v, pbest_pos, pbest_fit x,v, pbest_pos, pbest_fit
Parallel Programming • Computation vs. communication • Data parallelization Vs. task parallelization Processor Shared Memory Local Cache Message Passing Local Memory
Discussions Thank you!