440 likes | 447 Views
This presentation discusses the optimization of matrix multiplication algorithms using a classifier learning system. It covers cache-aware algorithms, recursive matrix partitioning, and the use of a classifier learning system for finding the best partitioning strategy. Experimental results are also presented.
E N D
Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign
Tuning library for recursive matrix multiplication • Use cache-aware algorithms that take into account architectural features • Memory hierarchy • Register file, … • Take into account input characteristics • matrix sizes • The process of tuning is automatic.
Recursive Matrix Partitioning • Previous approaches • Multiple recursive steps • Only divide by half A B
Recursive Matrix Partitioning • Previous approaches: • Multiple recursive steps • Only divide by half A B Step 1:
Recursive Matrix Partitioning • Previous approaches: • Multiple recursive steps • Only divide by half A B Step 2:
Recursive Matrix Partitioning • Our approach is more general • No need to divide by half • May use a single step to reach the same partition • Faster and more general A B Step 1:
Our approach • A general framework to describe a family of recursive matrix multiplication algorithms, where given the input dimensions of the matrices, we determine: • Number of partition levels • How to partition at each level • An intelligent search method based on a classifier learning system • Search for the best partitioning strategy in a huge search space
Outline • Background • Partition Methods • Classifier Learning System • Experimental Results
Recursive layout framework • Multiple levels of recursion • Takes into account the cache hierarchy
Recursive layout framework • Multiple levels of recursion • Takes into account the cache hierarchy 2 1 4 3
Recursive layout in our framework • Multiple levels of recursion • Takes into account the cache hierarchy
Recursive layout framework • Multiple levels of recursion • Takes into account the cache hierarchy
Recursive layout framework • Multiple levels of recursion • Takes into account the cache hierarchy 1 2 5 6 3 4 7 8 9 10 13 14 11 12 15 16
Padding • Necessary when the partition factor is not a divisor of the matrix dimension. Divide by 3 2000
Padding • Necessary when the partition factor is not a divisor of the matrix dimension. Divide by 3 2001 667
Padding • Necessary when the partition factor is not a divisor of the matrix dimension. Divide by 4 2001 667
Padding • Necessary when the partition factor is not a divisor of the matrix dimension. Divide by 4 2004 668
Recursive layout in our framework • Multiple level recursion • Support cache hierarchy • Square tile rectangular tile • Fit non-square matrixes
Recursive layout in our framework • Multiple level recursion • Support cache hierarchy • Square tile rectangular tile • Fit non-square matrixes 8 9
Recursive layout in our framework • Multiple level recursion • Support cache hierarchy • Square tile rectangular tile • Fit non-square matrixes 8 10 Padding
Recursive layout in our framework • Multiple level recursion • Support cache hierarchy • Square tile rectangular tile • Fit non-square matrixes 4 3
Outline • Background • Partition Methods • Classifier Learning System • Experimental Results
Two methods to partition matrices • Partition by Block (PB) • Specify the size of each tile • Example: • Dimensions (M,N,K) = (100, 100, 40) • Tile size (bm, bn, bk) = (50, 50, 20) Partition factors (pm, pn, pk) = (2,2,2) • Tiles need not to be square
Two methods to partition matrices • Partition by Size (PS) • Specify the maximum size of the three tiles. • Maintain the ratios between dimensions constant • Example: • (M,N,K) = (100, 100,50) • Maximum tile size for M,N = 1250 (pm, pn, pk) = (2,2,1) • Generalization of the “divide-by-half” approach. • Tile size = 1/4 * matrix size
Outline • Background • Partition Methods • Classifier Learning System • Experimental Results
Classifier Learning System • Use the two partition primitives to determine how the input matrices are partitioned • Determine partition factors at each level f: (M,N,K) (pmi,pni,pki), i=0,1,2 (only consider 3 levels) • The partition factors depend on the matrix size • Eg. The partitions factors of a (1000 x 1000) matrix should be different that those of a (50 x 1000) matrix. • The partition factors also depend on the architectural characteristics, like cache size.
Determine the best partition factors • The search space is huge exhaustive search is impossible • Our proposal: use a multi-step classifier learning system • Creates a table that given the matrix dimensions determines the partition factors
Classifier Learning System • The result of the classifier learning system is a table with two columns • Column 1 (Pattern): A string of ‘0’, ‘1’, and ‘*’ that encodes the dimensions of the matrices • Column 2 (Action): Partition method for one step • Built using the “partition-by-block” and “partition-by-size” primitives with different parameters.
Learn with Classifier System 5 bits / dim
How classifier learning algorithm works? • Change the table based on the feedback of performance and accuracy from previous runs. • Mutate the condition part of the table to adjust the range of matching matrix dimensions. • Mutate the action part to find the best partition method for the matching matrices.
Outline • Background • Partition Methods • Classifier Learning System • Experimental Results
Experimental Results • Experiments on three platforms • Sun UltraSparcIII • P4 Intel Xeon • Intel Itanium2 • Matrices of sizes from 1000 x 1000 to 5000 x 5000
Algorithms • Classifier MMM: our approach • Include the overhead of copying in and out of recursive layout • ATLAS: Library generated by ATLAS using the search procedure without hand-written codes. • Has some type of blocking for L2 • L1: One level of tiling • tile size: the same that ATLAS for L1 • L2: Two levels of tiling • L1tile and L2tile: the same that ATLAS for L1
Conclusion and Future Work • Preliminary results prove the effectiveness of our approach • Sun UltraSparcIII and Xeon: 18% and 5% improvement, respectively. • Itanium: -14% • Need to improve padding mechanism • Reduce the amount of padding • Avoid unnecessary computation on padding