260 likes | 367 Views
What is the WHT anyway, and why are there so many ways to compute it?. Jeremy Johnson. 1, 2, 6, 24, 112, 568, 3032, 16768,…. Walsh-Hadamard Transform. y = WHT N x, N = 2 n. WHT Algorithms. Factor WHT N into a product of sparse structured matrices Compute: y = (M 1 M 2 … M t )x
E N D
What is the WHT anyway, and why are there so many ways to compute it? Jeremy Johnson 1, 2, 6, 24, 112, 568, 3032, 16768,…
Walsh-Hadamard Transform • y = WHTN x, N = 2n
WHT Algorithms • Factor WHTN into a product of sparse structured matrices • Compute: y = (M1 M2 … Mt)x yt = Mtx … y2 = M2 y3 y = M1 y2
Factoring the WHT Matrix • AC Ä BD = (A Ä B)(C Ä D) • A Ä B = (A Ä I)(I Ä B) • A Ä (BÄ C) = (A Ä B)Ä C • ImÄ In = Imn WHT2Ä WHT2 = (WHT2 Ä I2)(I2 Ä WHT2)
Recursive and Iterative Factorization WHT8= (WHT2 Ä I4)(I2 Ä WHT4) • = (WHT2 Ä I4)(I2 Ä ((WHT2 ÄI2) (I2 Ä WHT2))) • = (WHT2 Ä I4)(I2 Ä (WHT2 ÄI2)) (I2 Ä (I2 Ä WHT2)) • = (WHT2 Ä I4)(I2 Ä (WHT2 ÄI2)) ((I2 Ä I2) Ä WHT2) • = (WHT2 Ä I4)(I2 Ä WHT2 ÄI2) ((I2 Ä I2) Ä WHT2) = (WHT2 Ä I4)(I2 Ä WHT2 ÄI2) (I4 Ä WHT2)
Recursive Algorithm WHT8 = (WHT2 I4)(I2 (WHT2 I2)(I2 WHT2))
é ù 1 1 1 1 1 1 1 1 ê ú - - - - 1 1 1 1 1 1 1 1 ê ú ê ú - - - - 1 1 1 1 1 1 1 1 ê ú - - - - 1 1 1 1 1 1 1 1 ê ú = ê ú - - - - 1 1 1 1 1 1 1 1 ê ú - - - - 1 1 1 1 1 1 1 1 ê ú ê ú - - - - 1 1 1 1 1 1 1 1 ê ú - - - - ê ú 1 1 1 1 1 1 1 1 ë û é ù é ù é ù 1 1 1 1 1 1 ê ú ê ú ê ú - 1 1 1 1 1 1 ê ú ê ú ê ú ê ú ê ú ê ú - 1 1 1 1 1 1 ê ú ê ú ê ú - - 1 1 1 1 1 1 ê ú ê ú ê ú ê ú ê ú ê ú - 1 1 1 1 1 1 ê ú ê ú ê ú - - 1 1 1 1 1 ê ú ê ú ê ú ê ú ê ú ê ú - - 1 1 1 1 1 1 ê ú ê ú ê ú - - - ê ú ê ú ê ú 1 1 1 1 1 1 ë û ë û ë û Iterative Algorithm WHT8 = (WHT2 I4)(I2 WHT2 I2)(I4 WHT2)
WHT Algorithms • Recursive • Iterative • General
n + ··· + n n + ··· + n i +1 t 1 i - 1 WHT Implementation • Definition/formula • N=N1* N2NtNi=2ni • x=WHTN*x x =(x(b),x(b+s),…x(b+(M-1)s)) • Implementation(nested loop) R=N; S=1; for i=t,…,1 R=R/Ni forj=0,…,R-1 for k=0,…,S-1 S=S* Ni; M b,s t Õ ) Ä Ä ( I I WHT WHT = n n 2 2 2 2 i i = 1
9 4 4 3 4 2 1 1 3 3 4 1 2 1 1 1 2 2 1 1 1 1 4 1 1 1 1 1 1 2 2 1 1 1 1 Partition Trees Left Recursive Right Recursive Balanced Iterative
Ordered Partitions • There is a 1-1 mapping from ordered partitions of n onto (n-1)-bit binary numbers. • There are 2n-1 ordered partitions of n. 162 = 1 0 1 0 0 0 1 0 1|1 1|1 1 1 1|1 1 1+2+4+2 = 9
3 2 1 1 1 Enumerating Partition Trees 00 01 01 3 3 3 2 1 2 1 1 1 10 10 11 3 3 1 2 1 1 1
Search Space • Optimization of the WHT becomes a search, over the space of partition trees, for the fastest algorithm. • The number of trees:
Size of Search Space • Let T(z) be the generating function for Tn Tn = (n/n3/2), where =4+8 6.8284 • Restricting to binary trees Tn = (5n/n3/2)
WHT PackagePüschel & Johnson (ICASSP ’00) • Allows easy implementation of any of the possible WHT algorithms • Partition tree representation W(n)=small[n] | split[W(n1),…W(nt)] • Tools • Measure runtime of any algorithm • Measure hardware events • Search for good implementation • Dynamic programming • Evolutionary algorithm
Histogram (n = 16, 10,000 samples) • Wide range in performance despite equal number of arithmetic operations (n2n flops) • Pentium III consumes more run time (more pipeline stages) • Ultra SPARC II spans a larger range
Operation Count Theorem. Let WN be a WHT algorithm of size N. Then the number of floating point operations (flops) used by WN is Nlg(N). Proof. By induction.
Instruction Count Model • A(n) = number of calls to WHT procedure • = number of instructions outside loops Al(n) = Number of calls to base case of size l • l = number of instructions in base case of size l • Li = number of iterations of outer (i=1), middle (i=2), and • outer (i=3) loop • i = number of instructions in outer (i=1), middle (i=2), and • outer (i=3) loop body
Small[1] .file "s_1.c" .version "01.01" gcc2_compiled.: .text .align 4 .globl apply_small1 .type apply_small1,@function apply_small1: movl 8(%esp),%edx //load stride S to EDX movl 12(%esp),%eax //load x array's base address to EAX fldl (%eax) // st(0)=R7=x[0] fldl (%eax,%edx,8) //st(0)=R6=x[S] fld %st(1) //st(0)=R5=x[0] fadd %st(1),%st // R5=x[0]+x[S] fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S] fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ????? fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0] fstpl (%eax) //store x[0]=x[0]+x[S] fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S] ret
Histogram using Instruction Model (P3) l = 12, l = 34, and l = 106 = 27 1 = 18, 2 = 18, and 1 = 20
Dynamic Programming n Cost( ), min n1+… + nt= n … T T nt n1 where Tn is the optimial tree of size n. This depends on the assumption that Cost only depends on the size of a tree and not where it is located. (true for IC, but false for runtime). For IC, the optimal tree is iterative with appropriate leaves. For runtime, DP is a good heuristic (used with binary trees).
Optimal Formulas UltraSPARC [1], [2], [3], [4], [5], [6] [[3],[4]] [[4],[4]] [[4],[5]] [[5],[5]] [[5],[6]] [[4],[[4],[4]]] [[[4],[5]],[4]] [[4],[[5],[5]]] [[[5],[5]],[5]] [[[5],[5]],[6]] [[4],[[[4],[5]],[4]]] [[4],[[4],[[5],[5]]]] [[4],[[[5],[5]],[5]]] [[5],[[[5],[5]],[5]]] Pentium [1], [2], [3], [4], [5], [6] [7] [[4],[4]] [[5],[4]] [[5],[5]] [[5],[6]] [[2],[[5],[5]]] [[2],[[5],[6]]] [[2],[[2],[[5],[5]]]] [[2],[[2],[[5],[6]]]] [[2],[[2],[[2],[[5],[5]]]]] [[2],[[2],[[2],[[5],[6]]]]] [[2],[[2],[[2],[[2],[[5],[5]]]]]] [[2],[[2],[[2],[[2],[[5],[6]]]]]] [[2],[[2],[[2],[[2],[[2],[[5],[5]]]]]]]
Different Strides • Dynamic programming assumption is not true. Execution time depends on stride.