1 / 36

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism. University of Michigan December 10, 2012. Announcements. Last class today! No more reading Dec 12-18 – Project presentations Each group sign up for 30-minute slot

buzz
Download Presentation

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 583 – Class 22Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012

  2. Announcements • Last class today! • No more reading • Dec 12-18 – Project presentations • Each group sign up for 30-minute slot • See me after class if you have not signed up • Course evaluations reminder • Please fill one out, it will only take 5 minutes • I do read them • Improve the experience for future 583 students

  3. Notes on Project Demos • Demo format • Each group gets 30 minutes • Strict deadlines enforced because many back to back groups • Don’t be late! • Figure out your room number ahead of time (see schedule on my door) • Plan for 20 mins of presentation (no more!), 10 mins questions • Some slides are helpful, try to have all group members say something • Talk about what you did (basic idea, previous work), how you did it (approach + implementation), and results • Demo or real code examples are good • Report • 5 pg double spaced including figures – what you did + why, implementation, and results • Due either when you do your demo or Dec 18 at 6pm

  4. SIMD Processors: Larrabee (now called Knights Corner) Block Diagram

  5. Vector Unit Block Diagram

  6. Processor Core Block Diagram

  7. Larrabee vs Conventional GPUs • Each Larrabee core is a complete Intel processor • Context switching & pre-emptive multi-tasking • Virtual memory and page swapping, even in texture logic • Fully coherent caches at all levels of the hierarchy • Efficient inter-block communication • Ring bus for full inter-processor communication • Low latency high bandwidth L1 and L2 caches • Fast synchronization between cores and caches • Larrabee: the programmability of IA with the parallelism of graphics processors

  8. Exploiting Superword Level Parallelism with Multimedia Instruction Sets

  9. Multimedia Extensions • Additions to all major ISAs • SIMD operations

  10. Using Multimedia Extensions • Library calls and inline assembly • Difficult to program • Not portable • Different extensions to the same ISA • MMX and SSE • SSE vs. 3DNow! • Need automatic compilation

  11. Vector Compilation • Pros: • Successful for vector computers • Large body of research • Cons: • Involved transformations • Targets loop nests

  12. Superword Level Parallelism (SLP) • Small amount of parallelism • Typically 2 to 8-way • Exists within basic blocks • Uncovered with a simple analysis • Independent isomorphic operations • New paradigm

  13. R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835 1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835

  14. R R G = G + X[i:i+2] B B 2. Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2]

  15. 3. Vectorizable Loops for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

  16. for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3] 3. Vectorizable Loops for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3]

  17. 4. Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

  18. for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1) 4. Partially Vectorizable Loops for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L)

  19. Exploiting SLP with SIMD Execution • Benefit: • Multiple ALU ops  One SIMD op • Multiple ld/st ops  One wide mem op • Cost: • Packing and unpacking • Reshuffling within a register

  20. C A 2 D B 3 = + Packing/Unpacking Costs C = A + 2 D = B + 3

  21. A A B B Packing/Unpacking Costs • Packing source operands A = f() B = g() C A 2 D B 3 C = A + 2 D = B + 3 = +

  22. A A B B C C D D Packing/Unpacking Costs • Packing source operands • Unpacking destination operands A = f() B = g() C A 2 D B 3 C = A + 2 D = B + 3 = + E = C / 5 F = D * 7

  23. Optimizing Program Performance • To achieve the best speedup: • Maximize parallelization • Minimize packing/unpacking • Many packing possibilities • Worst case: n ops n! configurations • Different cost/benefit for each choice

  24. Observation 1:Packing Costs can be Amortized • Use packed result operands A = B + C D = E + F G = A - H I = D - J

  25. Observation 1:Packing Costs can be Amortized • Use packed result operands • Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J

  26. Observation 2:Adjacent Memory is Key • Large potential performance gains • Eliminate ld/str instructions • Reduce memory bandwidth • Few packing possibilities • Only one ordering exploits pre-packing

  27. SLP Extraction Algorithm • Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  28. A B = X[i:i+1] SLP Extraction Algorithm • Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  29. A B = X[i:i+1] SLP Extraction Algorithm • Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  30. A B = X[i:i+1] H J C D A B = - SLP Extraction Algorithm • Follow def-use chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  31. A B = X[i:i+1] H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  32. A B = X[i:i+1] C D E F 3 5 = * H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  33. A B = X[i:i+1] C D E F 3 5 = * H J C D A B = - SLP Extraction Algorithm • Follow use-def chains A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

  34. SLP Availability

  35. SLP vs. Vector Parallelism

  36. Conclusions • Multimedia architectures abundant • Need automatic compilation • SLP is the right paradigm • 20% non-vectorizable in SPEC95fp • SLP extraction successful • Simple, local analysis • Provides speedups from 1.24 – 6.70 • Found SLP in general-purpose codes

More Related