1 / 23

Operating Systems and Architectures CS-M98: Coursework Solution

Operating Systems and Architectures CS-M98: Coursework Solution. Dr. Benjamin Mora. Swansea University. 1. Benjamin Mora. Marking range. Full understanding of problem and solution (>97) Ready for employment in HPC sector None of you (some very close though)!

sen
Download Presentation

Operating Systems and Architectures CS-M98: Coursework Solution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operating Systems and ArchitecturesCS-M98:Coursework Solution Dr. Benjamin Mora Swansea University 1 Benjamin Mora

  2. Marking range • Full understanding of problem and solution (>97) • Ready for employment in HPC sector • None of you (some very close though)! • Almost there with multithreading. (70 to 97) • Just need to see and understand solution. Most students in this category. • Real issues with multithreading concepts, merging temporary results, and few basic C errors (50 to 70) • Some hard work is really needed to understand the full solution • <50: Issues with basic (C) programming and algorithmic concepts, including pointers and creating a data-structures • Catching-up is crucial!!! Swansea University 2 Benjamin Mora

  3. Q1 • Alignement of Data. • Similar to lab exercise. • See CPU part 1. • 35 marks. Swansea University 3 Benjamin Mora

  4. Q1 voidAoS_to_SoA (float *image, int x, int y) { imageRed=newfloat[x*y+PADDING]; imageGreen=newfloat[x*y+PADDING]; imageBlue=newfloat[x*y+PADDING]; unsignedlonglongalignR=(((unsignedlonglong) *imageRed)&31)/4; unsignedlonglongalignG=(((unsignedlonglong) *imageGreen)&31)/4; unsignedlonglongalignB=(((unsignedlonglong) *imageBlue)&31)/4; alignedRed=imageRed+8-alignR; alignedGreen=imageGreen+8-alignG; alignedBlue=imageBlue+8-alignB; float*R=alignedRed; float*G=alignedGreen; float*B=alignedBlue; for(int i=0;i<x*y;i++) { R[i]=image[3*i]; G[i]=image[3*i+1]; B[i]=image[3*i+2]; } } Swansea University 4 Benjamin Mora

  5. Q2 Loop for k iterations for (int k=0;k<knnIterations;k++) { //1.init seed sums to 0 for(int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; } … Swansea University 5 Benjamin Mora

  6. Q2 Then … //2. Determine and compute average of closer seeds for (int pixel=0;pixel<x*y*3;pixel+=3) { floatmaxDistance=10; intfound=-1; for(int seed=0;seed<N;seed++) //Loop to be optimized { floatdx=image[pixel+0]-seeds[0][seed]; floatdy=image[pixel+1]-seeds[1][seed]; floatdz=image[pixel+2]-seeds[2][seed]; floatdistanceSquare=dx*dx+dy*dy+dz*dz; if(distanceSquare<maxDistance) { //A closer seed has been found maxDistance=distanceSquare; found=seed; } } Swansea University 6 Benjamin Mora

  7. Q2 Recompute new seeds //Last step for the iteration: compute average and update the current seed list for (int seed=0;seed<N;seed++) { if(seedCounters[seed]>0.01) { seeds[0][seed]=seedSums[0][seed]/seedCounters[seed]; seeds[1][seed]=seedSums[1][seed]/seedCounters[seed]; seeds[2][seed]=seedSums[2][seed]/seedCounters[seed]; } } …//End of iteration Swansea University 7 Benjamin Mora

  8. Q2 • Optimizing the inner loop • Process 8 pixels at a time. • Compare 8 pixels against one seed! • Some were confused and tried 8 pixels vs 8 seeds • Use cmplt and blend to replace condition. • 2 blend s instructions needed! • Some replicated mask computations! • The part after the inner loop cannot be parallelized though. • Still good speed-up using SIMD • Especially when # seeds > 32 • Many ways to do it. • Extra cast computations done by all of you! Swansea University 8 Benjamin Mora

  9. Q2 • Optimization comes from: • Processing 8 pixels at a time. • Removing the branch (no if then) • Still tricky to get good speed up. • Going further • Loop unrolling. • Minimize the number of computations inside the inner loop. • Put all constant operations like set1outside loop. • Avoid shared cache lines when multithreading! Swansea University 9 Benjamin Mora

  10. Q2 Loop for k iterations floatseedSums[3][N]; floatseedCounters[N]; //Seed initialization; for(int j=0;j<3;j++) for(inti=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); for(int k=0;k<knnIterations;k++) { for(int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; } Swansea University 10 Benjamin Mora

  11. Q2 Loop for k iterations floatseedSums[3][N];floatseedCounters[N]; float8 seedId[N]; for (int seed=0;seed<N;seed++) seedId[seed]=set1((float &) seed); for(intj=0;j<3;j++) for(inti=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); for(int k=0;k<knnIterations;k++) { float8 seeds8[3][N]; for(int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; seeds8[0][seed]=set1(seeds[0][seed]); seeds8[1][seed]=set1(seeds[1][seed]); seeds8[2][seed]=set1(seeds[2][seed]); } Swansea University 11 Benjamin Mora

  12. Q2 Then … //2. Determine and compute average of closer seeds for (int pixel=0;pixel<x*y*3;pixel+=3) { floatmaxDistance=10; intfound=-1; for(int seed=0;seed<N;seed++) //Loop to be optimized { floatdx=image[pixel+0]-seeds[0][seed]; floatdy=image[pixel+1]-seeds[1][seed]; floatdz=image[pixel+2]-seeds[2][seed]; floatdistanceSquare=dx*dx+dy*dy+dz*dz; if(distanceSquare<maxDistance) { //A closer seed has been found maxDistance=distanceSquare; found=seed; } } Swansea University 12 Benjamin Mora

  13. Q2 Then float8 *R=(float8 *) alignedRed; float8 *G=(float8 *) alignedGreen; float8 *B=(float8 *) alignedBlue; for (int pixel=0;pixel<x*y;pixel+=8) { float8 maxDistance=set1(10); float8 found8=set1(-1.f); //Just for initialization for(int seed=0;seed<N;seed++) //Loop to be optimized { float8 dx=sub8(R[0],seeds8[0][seed]); float8 dy=sub8(G[0],seeds8[1][seed]); float8 dz=sub8(B[0],seeds8[2][seed]); float8 distanceSquare=add8(add8(mul8(dx,dx),mul8(dy,dy)),mul8(dz,dz)); float8 comparison=cmplt8(distanceSquare,maxDistance); maxDistance=blend8(maxDistance,distanceSquare,comparison); found8=blend8(found8,seedId[seed],comparison); } Swansea University 13 Benjamin Mora

  14. Q2 Then //Sum the pixel values to the appropriate seed for(int i=0;i<8;i++) { intfound=(int&) found8.m256_f32[i]; seedCounters[found]+=1.; seedSums[0][found]+=((float *) R)[i]; seedSums[1][found]+=((float *) G)[i]; seedSums[2][found]+=((float *) B)[i]; } R++; G++; B++; } … Swansea University 14 Benjamin Mora

  15. Q2 Recompute new seeds Still the same!!!//Last step for the iteration: compute average and update the current seed list for (int seed=0;seed<N;seed++) { if(seedCounters[seed]>0.01) { seeds[0][seed]=seedSums[0][seed]/seedCounters[seed]; seeds[1][seed]=seedSums[1][seed]/seedCounters[seed]; seeds[2][seed]=seedSums[2][seed]/seedCounters[seed]; } } …//End of iteration Swansea University 15 Benjamin Mora

  16. Q3 • Most of you got the principles more or less right • Practical implementation was wrong! • Barriers were sometimes at the wrong location. • Most of you added extra, unneeded barriers. • Mutex have been accepted. • Putting a lock on every seed change is too much/not good! • Errors: • Only using results from one thread at each iteration. Swansea University 16 Benjamin Mora

  17. Q3 Idea • Break down image in 4 pieces • For each thread iteration: • Copy seeds in local variables (Performance) • Loop for the current chunk of pixels. • Compute seedSums and seeCounters the same way. • Copy results in globally visible but separate variables. • Barrier • One thread • Adds results from other threads to its own results • Then Compute RGB average and update seeds. • Barrier Swansea University 17 Benjamin Mora

  18. Q3 Creating Threads voidknnCompressionSIMDPosix(float *image, int x, int y) { AoS_to_SoA(image,x,y); threadJobSize=x*y/nbThreads; pthread_tthreads[nbThreads]; pthread_barrier_init(&barrier, NULL, nbThreads); for(int i=0;i<nbThreads;i++) pthread_create(&threads[i], NULL, posixThread, (void *) i); for(int i=0;i<nbThreads;i++) //separate loop pthread_join(threads[i], NULL); } Swansea University 18 Benjamin Mora

  19. Q3 Thread’s Job void * posixThread(void *arg) { longlongthreadNumber=(longlong) arg; intfirstPixel=threadNumber*threadJobSize; intlastPixel=firstPixel+threadJobSize; floatseedSums[3][N]; floatseedCounters[N]; //Seed initialization; float8 seedId[N]; for(int seed=0;seed<N;seed++) seedId[seed]=set1((float &) seed); if (threadNumber==0) for(intj=0;j<3;j++) for(inti=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); pthread_barrier_wait(&barrier); Swansea University 19 Benjamin Mora

  20. Q3 Thread’s Job for (int k=0;k<knnIterations;k++) { … Seed initalization is the same float8 *R=(float8 *) (alignedRed+firstPixel); float8 *G=(float8 *) (alignedGreen+firstPixel); float8 *B=(float8 *) (alignedBlue+firstPixel); for(int pixel=firstPixel;pixel<lastPixel;pixel+=8) { … loop code does not change … R++;G++;B++; } Swansea University 20 Benjamin Mora

  21. Q3 Merging Results for(int seed=0;seed<N;seed++) { temporaryResults[threadNumber][0][seed]=seedSums[0][seed]; temporaryResults[threadNumber][1][seed]=seedSums[1][seed]; temporaryResults[threadNumber][2][seed]=seedSums[2][seed]; temporaryCounters[threadNumber][seed]=seedCounters[seed]; } pthread_barrier_wait(&barrier); Swansea University 21 Benjamin Mora

  22. Q3 Merging Results if(threadNumber==0) { for(int thread=1;thread<nbThreads;thread++) for(int seed=0;seed<N;seed++) { temporaryResults[0][0][seed]+=temporaryResults[thread][0][seed]; temporaryResults[0][1][seed]+=temporaryResults[thread][1][seed]; temporaryResults[0][2][seed]+=temporaryResults[thread][2][seed]; temporaryCounters[0][seed]+=temporaryCounters[thread][seed]; } … Swansea University 22 Benjamin Mora

  23. Q3 Merging Results • for(int seed=0;seed<N;seed++) • { • if(temporaryCounters[0][seed]>0.01) • { • seeds[0][seed]=temporaryResults[0][0][seed] • /temporaryCounters[0][seed]; • seeds[1][seed]=temporaryResults[0][1][seed] • /temporaryCounters[0][seed]; • seeds[2][seed]=temporaryResults[0][2][seed] • /temporaryCounters[0][seed]; • } • } • } //end condition threadNumber==0 • pthread_barrier_wait(&barrier); • //end of iteration, seeds have been updated! Swansea University 23 Benjamin Mora

More Related