1 / 44

Optimizing Matrix Multiplication with a Classifier Learning System

This research focuses on tuning library for recursive matrix multiplication for efficient memory utilization, exploring cache-aware algorithms, and applying automatic tuning based on input characteristics. The Recursive Matrix Partitioning method is detailed, providing a step-by-step approach for improved matrix multiplication performance. The framework includes a Classifier Learning System to intelligently search for the best partitioning strategy. Experimental results and insights on the partition methods and the classifier learning approach are discussed.

mcraver
Download Presentation

Optimizing Matrix Multiplication with a Classifier Learning System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign

  2. Tuning library for recursive matrix multiplication • Use cache-aware algorithms that take into account architectural features • Memory hierarchy • Register file, … • Take into account input characteristics • matrix sizes • The process of tuning is automatic.

  3. Recursive Matrix Partitioning • Previous approaches • Multiple recursive steps • Only divide by half A B

  4. Recursive Matrix Partitioning • Previous approaches: • Multiple recursive steps • Only divide by half A B Step 1:

  5. Recursive Matrix Partitioning • Previous approaches: • Multiple recursive steps • Only divide by half A B Step 2:

  6. Recursive Matrix Partitioning • Our approach is more general • No need to divide by half • May use a single step to reach the same partition • Faster and more general A B Step 1:

  7. Our approach • A general framework to describe a family of recursive matrix multiplication algorithms, where given the input dimensions of the matrices, we determine: • Number of partition levels • How to partition at each level • An intelligent search method based on a classifier learning system • Search for the best partitioning strategy in a huge search space

  8. Outline • Background • Partition Methods • Classifier Learning System • Experimental Results

  9. Recursive layout framework • Multiple levels of recursion • Takes into account the cache hierarchy

  10. Recursive layout framework • Multiple levels of recursion • Takes into account the cache hierarchy 2 1 4 3

  11. Recursive layout in our framework • Multiple levels of recursion • Takes into account the cache hierarchy

  12. Recursive layout framework • Multiple levels of recursion • Takes into account the cache hierarchy

  13. Recursive layout framework • Multiple levels of recursion • Takes into account the cache hierarchy 1 2 5 6 3 4 7 8 9 10 13 14 11 12 15 16

  14. Padding • Necessary when the partition factor is not a divisor of the matrix dimension. Divide by 3 2000

  15. Padding • Necessary when the partition factor is not a divisor of the matrix dimension. Divide by 3 2001 667

  16. Padding • Necessary when the partition factor is not a divisor of the matrix dimension. Divide by 4 2001 667

  17. Padding • Necessary when the partition factor is not a divisor of the matrix dimension. Divide by 4 2004 668

  18. Recursive layout in our framework • Multiple level recursion • Support cache hierarchy • Square tile  rectangular tile • Fit non-square matrixes

  19. Recursive layout in our framework • Multiple level recursion • Support cache hierarchy • Square tile  rectangular tile • Fit non-square matrixes 8 9

  20. Recursive layout in our framework • Multiple level recursion • Support cache hierarchy • Square tile  rectangular tile • Fit non-square matrixes 8 10 Padding

  21. Recursive layout in our framework • Multiple level recursion • Support cache hierarchy • Square tile  rectangular tile • Fit non-square matrixes 4 3

  22. Outline • Background • Partition Methods • Classifier Learning System • Experimental Results

  23. Two methods to partition matrices • Partition by Block (PB) • Specify the size of each tile • Example: • Dimensions (M,N,K) = (100, 100, 40) • Tile size (bm, bn, bk) = (50, 50, 20) Partition factors (pm, pn, pk) = (2,2,2) • Tiles need not to be square

  24. Two methods to partition matrices • Partition by Size (PS) • Specify the maximum size of the three tiles. • Maintain the ratios between dimensions constant • Example: • (M,N,K) = (100, 100,50) • Maximum tile size for M,N = 1250 (pm, pn, pk) = (2,2,1) • Generalization of the “divide-by-half” approach. • Tile size = 1/4 * matrix size

  25. Outline • Background • Partition Methods • Classifier Learning System • Experimental Results

  26. Classifier Learning System • Use the two partition primitives to determine how the input matrices are partitioned • Determine partition factors at each level f: (M,N,K)  (pmi,pni,pki), i=0,1,2 (only consider 3 levels) • The partition factors depend on the matrix size • Eg. The partitions factors of a (1000 x 1000) matrix should be different that those of a (50 x 1000) matrix. • The partition factors also depend on the architectural characteristics, like cache size.

  27. Determine the best partition factors • The search space is huge  exhaustive search is impossible • Our proposal: use a multi-step classifier learning system • Creates a table that given the matrix dimensions determines the partition factors

  28. Classifier Learning System • The result of the classifier learning system is a table with two columns • Column 1 (Pattern): A string of ‘0’, ‘1’, and ‘*’ that encodes the dimensions of the matrices • Column 2 (Action): Partition method for one step • Built using the “partition-by-block” and “partition-by-size” primitives with different parameters.

  29. Learn with Classifier System

  30. Learn with Classifier System 5 bits / dim

  31. Learn with Classifier System 24 16

  32. Learn with Classifier System 24 16

  33. Learn with Classifier System 12 8

  34. Learn with Classifier System 12 8

  35. Learn with Classifier System 12 8

  36. Learn with Classifier System 4 4

  37. How classifier learning algorithm works? • Change the table based on the feedback of performance and accuracy from previous runs. • Mutate the condition part of the table to adjust the range of matching matrix dimensions. • Mutate the action part to find the best partition method for the matching matrices.

  38. Outline • Background • Partition Methods • Classifier Learning System • Experimental Results

  39. Experimental Results • Experiments on three platforms • Sun UltraSparcIII • P4 Intel Xeon • Intel Itanium2 • Matrices of sizes from 1000 x 1000 to 5000 x 5000

  40. Algorithms • Classifier MMM: our approach • Include the overhead of copying in and out of recursive layout • ATLAS: Library generated by ATLAS using the search procedure without hand-written codes. • Has some type of blocking for L2 • L1: One level of tiling • tile size: the same that ATLAS for L1 • L2: Two levels of tiling • L1tile and L2tile: the same that ATLAS for L1

  41. Conclusion and Future Work • Preliminary results prove the effectiveness of our approach • Sun UltraSparcIII and Xeon: 18% and 5% improvement, respectively. • Itanium: -14% • Need to improve padding mechanism • Reduce the amount of padding • Avoid unnecessary computation on padding

  42. Thank you!

More Related