60 likes | 209 Views
“Matrix Multiply ― in parallel”. Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago joe@joehummel.net. Background…. Class : “ Introduction to CS for Engineers ” Lang : C/C++ Focus : programming basics, vectors, matrices Timing : present this after introducing 2D arrays….
E N D
“Matrix Multiply ― in parallel” Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago joe@joehummel.net
Background… • Class: “Introduction to CS for Engineers” • Lang: C/C++ • Focus: programming basics, vectors, matrices • Timing: present this after introducing 2D arrays…
1500x1500 matrix: 2.25M elements » 32 seconds… Matrix multiply • Yes, it’s boring, but… • everyone understands the problem • good example of triply-nested loops • non-trivial computation for(inti= 0; i< N; i++) for(int j= 0; j< N; j++) for(int k= 0; k< N; k++) C[i][j] += (A[i][k] * B[k][j]);
Multicore • Matrix multiply is greatcandidate for multicore • embarrassingly-parallel • easy to parallelize viaoutermost loop #pragma omp parallel for for (inti= 0; i< N; i++) for(int j= 0; j< N; j++) for(int k= 0; k< N; k++) C[i][j] += (A[i][k] * B[k][j]); Cores 1500x1500 matrix: Quad-core CPU »8 seconds…
Designing for HPC • Parallelism alone is not enough… HPC == Parallelism + Memory Hierarchy─Contention Expose parallelism • Minimize interaction: • false sharing • locking • synchronization • Maximize data locality: • network • disk • RAM • cache • core
Data locality • What’s the other halfof the chip? • Implications? • No one implements MM this way • Rewrite to use loop interchange,and access B row-wise… Cache! #pragma omp parallel for for (inti= 0; i< N; i++) for(intk = 0; k < N; k++) for(intj = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); X 1500x1500 matrix: Quad-core + cache »2 seconds…