Operating Systems and Architectures CS-M98: Coursework Solution

Operating Systems and ArchitecturesCS-M98:Coursework Solution Dr. Benjamin Mora Swansea University 1 Benjamin Mora

Marking range • Full understanding of problem and solution (>97) • Ready for employment in HPC sector • None of you (some very close though)! • Almost there with multithreading. (70 to 97) • Just need to see and understand solution. Most students in this category. • Real issues with multithreading concepts, merging temporary results, and few basic C errors (50 to 70) • Some hard work is really needed to understand the full solution • <50: Issues with basic (C) programming and algorithmic concepts, including pointers and creating a data-structures • Catching-up is crucial!!! Swansea University 2 Benjamin Mora

Q1 • Alignement of Data. • Similar to lab exercise. • See CPU part 1. • 35 marks. Swansea University 3 Benjamin Mora

Q1 voidAoS_to_SoA (float *image, int x, int y) { imageRed=newfloat[x*y+PADDING]; imageGreen=newfloat[x*y+PADDING]; imageBlue=newfloat[x*y+PADDING]; unsignedlonglongalignR=(((unsignedlonglong) *imageRed)&31)/4; unsignedlonglongalignG=(((unsignedlonglong) *imageGreen)&31)/4; unsignedlonglongalignB=(((unsignedlonglong) *imageBlue)&31)/4; alignedRed=imageRed+8-alignR; alignedGreen=imageGreen+8-alignG; alignedBlue=imageBlue+8-alignB; float*R=alignedRed; float*G=alignedGreen; float*B=alignedBlue; for(int i=0;i<x*y;i++) { R[i]=image[3*i]; G[i]=image[3*i+1]; B[i]=image[3*i+2]; } } Swansea University 4 Benjamin Mora

Q2 Loop for k iterations for (int k=0;k<knnIterations;k++) { //1.init seed sums to 0 for(int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; } … Swansea University 5 Benjamin Mora

Q2 Then … //2. Determine and compute average of closer seeds for (int pixel=0;pixel<x*y*3;pixel+=3) { floatmaxDistance=10; intfound=-1; for(int seed=0;seed<N;seed++) //Loop to be optimized { floatdx=image[pixel+0]-seeds[0][seed]; floatdy=image[pixel+1]-seeds[1][seed]; floatdz=image[pixel+2]-seeds[2][seed]; floatdistanceSquare=dx*dx+dy*dy+dz*dz; if(distanceSquare<maxDistance) { //A closer seed has been found maxDistance=distanceSquare; found=seed; } } Swansea University 6 Benjamin Mora

Q2 Recompute new seeds //Last step for the iteration: compute average and update the current seed list for (int seed=0;seed<N;seed++) { if(seedCounters[seed]>0.01) { seeds[0][seed]=seedSums[0][seed]/seedCounters[seed]; seeds[1][seed]=seedSums[1][seed]/seedCounters[seed]; seeds[2][seed]=seedSums[2][seed]/seedCounters[seed]; } } …//End of iteration Swansea University 7 Benjamin Mora

Q2 • Optimizing the inner loop • Process 8 pixels at a time. • Compare 8 pixels against one seed! • Some were confused and tried 8 pixels vs 8 seeds • Use cmplt and blend to replace condition. • 2 blend s instructions needed! • Some replicated mask computations! • The part after the inner loop cannot be parallelized though. • Still good speed-up using SIMD • Especially when # seeds > 32 • Many ways to do it. • Extra cast computations done by all of you! Swansea University 8 Benjamin Mora

Q2 • Optimization comes from: • Processing 8 pixels at a time. • Removing the branch (no if then) • Still tricky to get good speed up. • Going further • Loop unrolling. • Minimize the number of computations inside the inner loop. • Put all constant operations like set1outside loop. • Avoid shared cache lines when multithreading! Swansea University 9 Benjamin Mora

Q2 Loop for k iterations floatseedSums[3][N]; floatseedCounters[N]; //Seed initialization; for(int j=0;j<3;j++) for(inti=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); for(int k=0;k<knnIterations;k++) { for(int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; } Swansea University 10 Benjamin Mora

Q2 Loop for k iterations floatseedSums[3][N];floatseedCounters[N]; float8 seedId[N]; for (int seed=0;seed<N;seed++) seedId[seed]=set1((float &) seed); for(intj=0;j<3;j++) for(inti=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); for(int k=0;k<knnIterations;k++) { float8 seeds8[3][N]; for(int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; seeds8[0][seed]=set1(seeds[0][seed]); seeds8[1][seed]=set1(seeds[1][seed]); seeds8[2][seed]=set1(seeds[2][seed]); } Swansea University 11 Benjamin Mora

Q2 Then … //2. Determine and compute average of closer seeds for (int pixel=0;pixel<x*y*3;pixel+=3) { floatmaxDistance=10; intfound=-1; for(int seed=0;seed<N;seed++) //Loop to be optimized { floatdx=image[pixel+0]-seeds[0][seed]; floatdy=image[pixel+1]-seeds[1][seed]; floatdz=image[pixel+2]-seeds[2][seed]; floatdistanceSquare=dx*dx+dy*dy+dz*dz; if(distanceSquare<maxDistance) { //A closer seed has been found maxDistance=distanceSquare; found=seed; } } Swansea University 12 Benjamin Mora

Q2 Then float8 *R=(float8 *) alignedRed; float8 *G=(float8 *) alignedGreen; float8 *B=(float8 *) alignedBlue; for (int pixel=0;pixel<x*y;pixel+=8) { float8 maxDistance=set1(10); float8 found8=set1(-1.f); //Just for initialization for(int seed=0;seed<N;seed++) //Loop to be optimized { float8 dx=sub8(R[0],seeds8[0][seed]); float8 dy=sub8(G[0],seeds8[1][seed]); float8 dz=sub8(B[0],seeds8[2][seed]); float8 distanceSquare=add8(add8(mul8(dx,dx),mul8(dy,dy)),mul8(dz,dz)); float8 comparison=cmplt8(distanceSquare,maxDistance); maxDistance=blend8(maxDistance,distanceSquare,comparison); found8=blend8(found8,seedId[seed],comparison); } Swansea University 13 Benjamin Mora

Q2 Then //Sum the pixel values to the appropriate seed for(int i=0;i<8;i++) { intfound=(int&) found8.m256_f32[i]; seedCounters[found]+=1.; seedSums[0][found]+=((float *) R)[i]; seedSums[1][found]+=((float *) G)[i]; seedSums[2][found]+=((float *) B)[i]; } R++; G++; B++; } … Swansea University 14 Benjamin Mora

Q2 Recompute new seeds Still the same!!!//Last step for the iteration: compute average and update the current seed list for (int seed=0;seed<N;seed++) { if(seedCounters[seed]>0.01) { seeds[0][seed]=seedSums[0][seed]/seedCounters[seed]; seeds[1][seed]=seedSums[1][seed]/seedCounters[seed]; seeds[2][seed]=seedSums[2][seed]/seedCounters[seed]; } } …//End of iteration Swansea University 15 Benjamin Mora

Q3 • Most of you got the principles more or less right • Practical implementation was wrong! • Barriers were sometimes at the wrong location. • Most of you added extra, unneeded barriers. • Mutex have been accepted. • Putting a lock on every seed change is too much/not good! • Errors: • Only using results from one thread at each iteration. Swansea University 16 Benjamin Mora

Q3 Idea • Break down image in 4 pieces • For each thread iteration: • Copy seeds in local variables (Performance) • Loop for the current chunk of pixels. • Compute seedSums and seeCounters the same way. • Copy results in globally visible but separate variables. • Barrier • One thread • Adds results from other threads to its own results • Then Compute RGB average and update seeds. • Barrier Swansea University 17 Benjamin Mora

Q3 Creating Threads voidknnCompressionSIMDPosix(float *image, int x, int y) { AoS_to_SoA(image,x,y); threadJobSize=x*y/nbThreads; pthread_tthreads[nbThreads]; pthread_barrier_init(&barrier, NULL, nbThreads); for(int i=0;i<nbThreads;i++) pthread_create(&threads[i], NULL, posixThread, (void *) i); for(int i=0;i<nbThreads;i++) //separate loop pthread_join(threads[i], NULL); } Swansea University 18 Benjamin Mora

Q3 Thread’s Job void * posixThread(void *arg) { longlongthreadNumber=(longlong) arg; intfirstPixel=threadNumber*threadJobSize; intlastPixel=firstPixel+threadJobSize; floatseedSums[3][N]; floatseedCounters[N]; //Seed initialization; float8 seedId[N]; for(int seed=0;seed<N;seed++) seedId[seed]=set1((float &) seed); if (threadNumber==0) for(intj=0;j<3;j++) for(inti=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); pthread_barrier_wait(&barrier); Swansea University 19 Benjamin Mora

Q3 Thread’s Job for (int k=0;k<knnIterations;k++) { … Seed initalization is the same float8 *R=(float8 *) (alignedRed+firstPixel); float8 *G=(float8 *) (alignedGreen+firstPixel); float8 *B=(float8 *) (alignedBlue+firstPixel); for(int pixel=firstPixel;pixel<lastPixel;pixel+=8) { … loop code does not change … R++;G++;B++; } Swansea University 20 Benjamin Mora

Q3 Merging Results for(int seed=0;seed<N;seed++) { temporaryResults[threadNumber][0][seed]=seedSums[0][seed]; temporaryResults[threadNumber][1][seed]=seedSums[1][seed]; temporaryResults[threadNumber][2][seed]=seedSums[2][seed]; temporaryCounters[threadNumber][seed]=seedCounters[seed]; } pthread_barrier_wait(&barrier); Swansea University 21 Benjamin Mora

Q3 Merging Results if(threadNumber==0) { for(int thread=1;thread<nbThreads;thread++) for(int seed=0;seed<N;seed++) { temporaryResults[0][0][seed]+=temporaryResults[thread][0][seed]; temporaryResults[0][1][seed]+=temporaryResults[thread][1][seed]; temporaryResults[0][2][seed]+=temporaryResults[thread][2][seed]; temporaryCounters[0][seed]+=temporaryCounters[thread][seed]; } … Swansea University 22 Benjamin Mora

Q3 Merging Results • for(int seed=0;seed<N;seed++) • { • if(temporaryCounters[0][seed]>0.01) • { • seeds[0][seed]=temporaryResults[0][0][seed] • /temporaryCounters[0][seed]; • seeds[1][seed]=temporaryResults[0][1][seed] • /temporaryCounters[0][seed]; • seeds[2][seed]=temporaryResults[0][2][seed] • /temporaryCounters[0][seed]; • } • } • } //end condition threadNumber==0 • pthread_barrier_wait(&barrier); • //end of iteration, seeds have been updated! Swansea University 23 Benjamin Mora

Operating Systems and Architectures CS-M98: Coursework Solution