1 / 25

Cellular Neural Networks Training and Implementation on a GPU Cluster

Cellular Neural Networks Training and Implementation on a GPU Cluster. Bipul Luitel , Cameron Johnson & Sinchan Roychowdhury Real-Time Power and Intelligent Systems Lab, Missouri S&T, Rolla, MO. Outline .

risa
Download Presentation

Cellular Neural Networks Training and Implementation on a GPU Cluster

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cellular Neural Networks Training and Implementation on a GPU Cluster BipulLuitel, Cameron Johnson & SinchanRoychowdhury Real-Time Power and Intelligent Systems Lab, Missouri S&T, Rolla, MO CS387 Spring 2011, April 26, 2011

  2. Outline • Cellular Neural Networks • Architecture • Application – Wide Area Monitoring • Training a neural network using particle swarm optimization • PSO - Introduction • Implementation • Parallelization – Concurrencies in CNN and PSO • GPU Computing in RTPIS Lab • Architecture of the RTPIS GPU cluster • Some timing comparisons • Discussion

  3. Cellular Neural Networks • Cellular Neural Networks (CNN) • A class of neural networks architecture apart from feedforward and feedback architecture. • Traditionally, each element of the network (cell) is a computational unit. • A different variation consists of neural networks (NN) on each cell connected to each other in different fashion depending on the problem. • Each cell is a feedforward or feedback neural network. • Output of one NN (cell) may be connected to the input(s) of one or more neighboring cells, thus forming an iterative feedback loop during training.

  4. Application – Wide Area Monitoring in Power Systems • Smart grid is a complex distributed cyber-physical power and energy system • Communication, computation and control • Remote monitoring of distributed systems is necessary to assess the status of the system • Wide area monitoring • Monitor status of generators – speed deviation • Monitor status of buses – bus voltage • Assist in predictive state estimation in order to provide real-time control • Becomes a challenge when the size of the network is large with many parameters to be monitored • Generator speed deviation predictions using cellular neural networks

  5. Implementation of CNN based WAM Wide Area Monitoring System for Twelve-bus System Wide Area Monitoring System for Two-Area Four-Machine System

  6. Implementation • Architecture of the CNN can be problem dependent.

  7. Training • Training of CMLP is a challenge due to iterative feedback. Initialize weights For each training sample For each cell Train the neural network

  8. Training • Training in parallel using particle swarm optimization.

  9. Outline • Cellular Neural Networks • Architecture • Application – Wide Area Monitoring • Training a neural network using particle swarm optimization • PSO - Introduction • Implementation • Parallelization – Concurrencies in CNN and PSO • GPU Computing in RTPIS Lab • Architecture of the RTPIS GPU cluster • Some timing comparisons • Discussion

  10. Particle Swarm Optimization • Population-based search algorithm • Many points tested • Form of guided search • Sound familiar? • GAs are also population-based • Darwinian evolution • PSO assumes flock or swarm, instead • Goals scattered about a search space • More searchers means they’re found faster

  11. PSO Algorithm • Define the objective function • Identify the independent variables • These make up the coordinate system of the search hyperspace • Initialize a population of n particles with random coordinates and velocities • Find their fitnesses • Record personal best • Record global best • Update velocities • Update locations • Find new fitnesses • Update personal best • Update global best • Run until termination conditions reached • Minimum acceptable fitness reached • Maximum number of fitness tests reached • Maximum real-world time allotted passed f=3 y f=6 f=0 f=1 f=1 f=0 f=0 f=5 x

  12. Parallel PSO Implementation • Obvious concurrency • each particle on a node • Each particle’s fitness is independent • Particles can update velocities and positions concurrently • What’s the catch? • Communication overhead • Have to share fitness information to determine new gbest

  13. Finds local bests PSO Topology and Cluster Topology • How do the particles communicate? • Consider your hardware • Particles on the same node (if #particles > #nodes) can have full connection • Particles on adjacent nodes can communicate in O(1) • Particles on further nodes require more hops • Local Best: best amongst all neighbors ring topology Multi-step to find Global best Global bests wheel topology star topology

  14. Training a CNN with a PSO Using Parallel Processing • Concurrency: each particle of the PSO • Implement the entire CNN on each node, and treat each node as a particle • Concurrency: NNs that make up the CNN • Implement the CNN in parallel • One NN per node; operate independently during a single time step • Communication between NNs is sparse • Have each candidate weight set tested sequentially on the node holding a given NN • With arbitrary nodes? • Take advantage of both concurrencies! • Each node receives one cell • A CNN takes c nodes, where c is the number of cells • For n particles, the PSO then uses n x c nodes P1 P2 P3 P4 NN1 NN2 NN3 NN4 P5 P6 P7 P8 NN5 NN6 NN7 NN8 P9 P10 P11 P12 NN9 NN 10 NN 11 NN 12 P14 P15 P16 NN 14 NN 15 NN 16 P13 NN 13

  15. Outline • Cellular Neural Networks • Architecture • Application – Wide Area Monitoring • Training a neural network using particle swarm optimization • PSO - Introduction • Implementation • Parallelization – Concurrencies in CNN and PSO • GPU Computing in RTPIS Lab • Architecture of the RTPIS GPU cluster • Some timing comparisons • Discussion

  16. Architecture of RTPIS Cluster • Hardware Configuration • Nodes: 1 (master )+ 16 (nodes) = 17 • CPU: 17 (nodes) x 2 (CPUs/node) x 4 (cores/CPU) x 2 (threads/core) = 272 • GPU: 16 (nodes) x 2 (NVIDIA Tesla C2050 GPUs/node) x 448 (CUDA cores/GPU) = 14336 • Memory: 17 (nodes) x 12 GB = 204 GB • Storage: 2 x 500 GB (OS) + 10 x 2 TB (Master) + 16 x 500 GB (nodes) = 29 TB • Software: • Operating System:OpenSUSE 11 Linux • Others: Torque, Maui scheduler, CUDA toolkit/libraries and GNU compilers, C/C++ with MPI libraries, MATLAB Distributed Computing Server

  17. GPU Computing - Matlab • GPU computing is built in MATLAB R2010b • Run part of the code in GPU using MATLAB • Use gpuArray or arrayfun commands in MATLAB • Use compiled CUDA code as PTX file to use in MATLAB • Only useful when computing time in CPU exceeds the communication time for transferring variables between CPU and GPU.

  18. Matlab Commands for working on GPU • Create a variable in CPU and move to GPU • gpuVar= gpuArray(cpuVar); • Create a variable in GPU directly • gpuVar = gpu.parallel.GPUArray.zeros(5,10); • Any function that uses GPU variables is performed on a GPU • gpuVar = fft(gpuVar); gpuVar = abs(gpuVar1*gpuVar2); • Use arrayfun to perform operation on a GPU • gpuVar = arrayfun(@min,gpuVar1); • gpuVar = arrayfun(@customFn,gpuVar1,cpuVar); • cpuVar = arrayfun(@customFn,cpuVar1,cpuVar2,cpuVar3) • Move a variable from GPU to CPU: • cpuVar = gather(gpuVar);

  19. Parallel Programming Introduction – MATLAB MATLAB PCT MATLAB PCT MATLAB PCT • MPI_C vs MATLAB parallel • computing C: MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Bcast(buffer,count,datatype,root,comm); MPI_Reduce(sendbuf,recvbuf,count,datatype,op,root,comm); MPI_Barrier(comm) MATLAB: labindex labBroadcast(source,value), labBroadcast(source) gop(@function,value) labBarrier; Matlab Job Manager lab lab lab lab lab lab lab lab

  20. Examples - implementation • Sequential Neural Networks training using PSO Initialization P1 P2 P3 P4 For each iteration P5 P6 P7 P8 For each particle P9 P10 P11 P12 Fitness evaluation P13 P14 P15 P16 gbest update Update position and velocity Dimensions = number of weights Fitness = F(NN output)

  21. Examples - implementation • Parallel Neural Network training using PSO x,v, pbest_pos, pbest_fit x,v, pbest_pos, pbest_fit Initialization x,v, pbest_pos, pbest_fit x,v, pbest_pos, pbest_fit P1 P2 P3 P4 For each iteration x,v, pbest_pos, pbest_fit P5 P6 P7 P8 x,v, pbest_pos, pbest_fit Fitness evaluation x,v, pbest_pos, pbest_fit P9 P10 P11 P12 x,v, pbest_pos, pbest_fit Synchronize gbest update P13 P14 P15 P16 x,v, pbest_pos, pbest_fit Update position and velocity x,v, pbest_pos, pbest_fit gbest_pos gbest_fit x,v, pbest_pos, pbest_fit x,v, pbest_pos, pbest_fit

  22. Examples – comparison of platforms

  23. Examples – comparison of platforms

  24. Parallel Programming • Computation vs. communication • Data parallelization Vs. task parallelization Processor Shared Memory Local Cache Message Passing Local Memory

  25. Discussions Thank you!

More Related