180 likes | 193 Views
This article explores the performance comparison of FFTW libraries on SiCortex architecture, including optimizations and benchmarks. It also discusses the potential for improving FFTW's speed and explores alternative algorithms.
E N D
FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008
The email that launched a thousand processors • “Part of my work here at SiCortex is to work with the FFT libraries we provide to our customers.... When I do a comparison of performance of the serial version of FFTW 2.1.5 and 3.2 alpha, FFTW 3.2 alpha is significantly slower. • Both codes are compiled with the same compiler flags. I am running the 64 bit version: • CC=scpathcc LD=scld AR=scar RANLIB=scranlib F77=scpathf95 • CFLAGS="-g -O3 -OPT:Ofast -mips64" MPILIBS="-lscmpi" • ./configure --build=x86_64-pc-linux-gnu • --host=mips64el-gentoo-linux-gnu --enable-fma • FFTW 3.2alpha • bench -opatient -s icf2048x2048 • Problem: icf2048x2048, setup: 3.79 s, time: 4.21 s, ``mflops'':109.7 • FFTW 2.1.5 • Please wait (and dream of faster computers). • SPEED TEST: 2048x2048, FFTW_FORWARD, in place, specific • time for one fft: 2.914490 s (694.868565 ns/point) • "mflops" = 5 (N log2 N) / (t in microseconds) = 158.303319 • One thing I should point out is that FFTW 2.1.5 has been modified from its original stock version. It has had some optimizations based on the work by Franchetti and his paper "FFT Algorithms for Multiply-Add Architectures." From my understanding these optimizations have been folded into the genFFT code generator for FFTW 3.*.* • Am I correct in my assumption that FFTW 3.2 should be significantly faster?”
Where to start? • Maybe try benchmarking FFTW for myself • But I'm too lazy to write benchmarking code, especially if it already exists somewhere • Unfortunately, that “somewhere” happens to be inside the FFTW package • … and it has its own dependencies • … so I can't simply grab a file and link to the FFTW library on our SC648… ugh • Fine, I'll just download the whole package • … compile the whole thing (with the default config, since I don't know how to do anything smarter) • … and then hijack the last link step to relink against the "real" SC-optimized library
Some surprises • “The numbers you sent previously (for a 2048x2048 complex transform) were: • fftw-2.1.5: ~155 mflops • fftw-3.2alpha: ~110 mflops • My initial tests didn't find a discrepancy nearly that big: • fftw-2.1.5: ~140 mflops • fftw-3.1.2: ~120 mflops • fftw-3.2alpha3: ~130 mflops • Of course, these results are simply using the downloaded version of the libraries without the SiCortex tweaks you mentioned before (and also with gcc and not PathScale).”
More surprises • “Strangely, I then tried re-linking fftw_test and bench using the precompiled fftw libraries that came on the machine and got a substantial slowdown! • /usr/lib/libsfftw.a [2.1.5]: ~45 mflops • /usr/lib/libfftw3.la [3.2alpha]: ~90 mflops • One other thing I noticed was that the configure script for fftw3 complained about not finding a hardware cycle counter… • But fftw2 didn't produce this warning -- perhaps it doesn't use it. • Does a cycle counter actually exist (and do I just need to tell configure where it is)?”
Meanwhile, let's play with FFTW and our machine • In the end, the algorithms make the difference, but we need to know where to look • Previously, I thought about algorithms in terms of O(n log n) flops • Need to get a feel for what's really important
How fast is FFTW anyway? Is there any hope of doing better? • Grab a competitor from “Jorg's useful and ugly FFT page” • 2-dim FFT modernized and cleaned up by Stefan Gustavson • Simple but decently optimized radix-842 transformation that does rows, then cols • Output time taken for just rows, then total time taken • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 1024 • Enter number of cols: 1024 • Time elapsed: 2.040359 sec • Time elapsed: 4.149302 sec • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 2048 • Enter number of cols: 2048 • Time elapsed: 8.482921 sec • Time elapsed: 17.633655 sec • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 4096 • Enter number of cols: 4096 • Time elapsed: 37.851723 sec • Time elapsed: 80.187843 sec
How fast is FFTW anyway? Is there any hope of doing better? • How about this –O3 business? • ploh@sc1-m3n6 ~/kube-gustavson $ gcc -c timed-kube-gustavson-fft.c -O3 • ploh@sc1-m3n6 ~/kube-gustavson $ gcc test_kube.c timed-kube-gustavson-fft.o -lm • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 1024 • Enter number of cols: 1024 • Time elapsed: 0.555099 sec • Time elapsed: 1.172668 sec • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 2048 • Enter number of cols: 2048 • Time elapsed: 2.321096 sec • Time elapsed: 5.325778 sec • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 4096 • Enter number of cols: 4096 • Time elapsed: 11.917786 sec • Time elapsed: 28.651167 sec
How fast is FFTW anyway? Is there any hope of doing better? • And how about SiCortex’s optimized math library? • ploh@sc1-m3n6 ~/kube-gustavson $ gcc test_kube.c timed-kube-gustavson-fft.o -O3 -lscm -lm • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 1024 • Enter number of cols: 1024 • Time elapsed: 0.540417 sec • Time elapsed: 1.143589 sec • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 2048 • Enter number of cols: 2048 • Time elapsed: 2.260734 sec • Time elapsed: 5.208607 sec • ploh@sc1-m3n6 ~/kube-gustavson $ ./a.out • Enter number of rows: 4096 • Enter number of cols: 4096 • Time elapsed: 11.729611 sec • Time elapsed: 28.277497 sec
How fast is the machine anyway? How speedy can we hope to get? • 150 mflops sounds really slow compared to benchFFT graphs! • http://www.fftw.org/speed/CoreDuo-3.0GHz-icc64/ • Why: 500MHz MIPS processors • Capable of performing two instructions per cycle*, but still… just 500MHz • (On the other hand, the machine is cool. • Very cool: 1W per processor! • More like 3W counting interconnect and everything else needed to keep a processor happy, but still just 16 kW for 5832 procs.) • So if you want to run a serial FFT, grab a desktop -- even a five-year old desktop.
Let's talk parallel • “Meanwhile, I've been trying to get a feel for what our SC648 has to offer in terms of relative speeds of communication and computation. (Mainly I wanted a clearer picture of where to look for parallel FFT speedup.) A couple numbers I got were a little surprising -- please comment if you have insight. • Point-to-point communication speed for MPI Isend/Irecv nears 2GB/sec as advertised for message sizes in the MB range. Latency is such that a 32KB message achieves half the max. • The above speeds seem to be mostly independent of the particular pair of processors which are trying to talk, except that in a few cases -- apparently when the processors belong to the same 6-CPU node -- top speed is only 1GB/sec. This is the opposite of what I (perhaps naively) expected -- any explanation?” • [Where's the Kautz graph???]
Let's talk parallel • “Upon introducing network traffic by asking 128 processors to simultaneously perform 64 pairwise communications, speed drops to around 0.5GB/sec with substantial fluctuations. • For comparison, the speed of memcpy on a single CPU is as follows: • Size (bytes) Time (sec) Rate (MB/sec) • 256 0.000000 1152.369380 • 512 0.000000 1367.888343 • 1024 0.000001 1509.023977 • 2048 0.000001 1591.101956 • 4096 0.000003 1421.082913 • 8192 0.000006 1436.982715 • 16384 0.000010 1638.373220 • 32768 0.000097 337.867886 • 65536 0.000194 337.250067 • 131072 0.000413 317.622002 • 262144 0.001054 248.758781 • 524288 0.002281 229.862557 • 1048576 0.004371 239.917118 • 2097152 0.008626 243.109613 • The table above shows clear fall-offs when the L1 cache (32KB) and L2 cache (256KB) are exceeded. What really surprises me, however, is that according to the previous experiments, sending data to another processor is as fast as memcpy even in the best-case scenario (memcpy staying within L1), and is many times faster once the array being copied no longer fits in cache! (Don't the MPI operations also have to leave cache though? How can point-to-point copy be so much faster?)”
A trip to Maynard • Sample output from FFTW 3.1.2: • ~3 sec planning, ~30 sec for the transform, ~3 mflops?! • But I got 100+ mflops on my unoptimized version! • Try rebuilding with a couple different config options… • Aha! Blame “--enable-long-double” • Still off by a factor of 2 from FFTW 2.1.5; the missing cycle.h probably accounts for some of that but probably not all of it
Back to algorithms: What plans does FFTW actually choose? • “I found out that turning on the "verbose" option (fftw_test -v for 2.x and bench -v5, say, for 3.x) outputs the plan. Based on a handful of test runs on power-of-2 transforms, FFTW 2.1.5 seems to prefer: • radix 16 and radix 8, for transform lengths up to about 2^15 • primarily radix 4 but finishing off with three radix 16 steps, for larger transforms. • There are also occasional radix 2 and 32 steps but these seem to be rare. • [In practice, FFTW 2 for 1D power-of-2 transforms seems to just try the various possible radices up to 64 -- nothing fancier than that. FFTW 3 tries a lot more things, including a radix-\sqrt{n} first step.] • Last week we weren't totally satisfied with the 258 "mflops" we were getting from Andy's optimized FFTW 2.1.5; however, the transform we were running was fairly large, and in fact Andy's FFTW 2 gets 500+ "mflops" on peak sizes (around 1024). Basically, there's a falloff in performance as soon as the L1 cache is exceeded and another one once we start reaching into main memory. All of the graphs on the benchFFT page show these falloffs; indeed, getting a peak "mflops" number slightly better than the clock rate is about the best anyone can do (on non-SIMD processors) -- congrats. • On the other hand, we still don't really have a working FFTW 3: without a cycle counter, the code currently does no timing whatsoever and instead chooses plans based on a heuristic. Hopefully we can get that straightened out soon, because at the moment its planner information is pretty useless to us! • Once that's fixed we'll see how its speed compares to FFTW 2. Either way, I'd say FFTW 2 is pretty well-optimized (although the library you're currently distributing with your machines is not!) and our focus should really be on FFTW 3. In particular, the algorithms that FFTW 2 uses for parallel transforms are just plain slow (I can elaborate on that if you wish). On the bright side, 3.2alpha already implements several improvements -- the first three or four possibilities that came to mind for me are all already there -- and despite being in alpha, it's in good enough shape to start playing with, once we have a cycle counter.”
What does parallel FFTW actually do? • FFTW 2: Transpose, 1D FFTs, transpose, 1D FFTs, transpose -- straight from the book • Choose radix as close to \sqrt{n} as possible • Example: 1D complex double forward transform of size 2^24 (with workspace) • "Parallel" algorithm on 1 proc: 25% slowdown • 2 procs: 1.5x speedup • … • 64 procs: 29.900830x • 128 procs: 67.845401x • 256 procs: 119.773536x (12 GFlops) • 512 procs: 143.338615x (15 GFlops) • FFTW 3.2alpha: Same general "six-step" algorithm, but with twists • Choose radix equal to np if possible • Skip local transposes -- let FFTW decide whether or not to do them • Consider doing transposes in stages to avoid all-to-all communication • Example (sans cycle counter!): • Problem: ocf16777216, setup: 18.27 s, time: 308.69 ms, ``mflops'': 6522 • Problem: ocf16777216, setup: 13.82 s, time: 158.16 ms, ``mflops'': 12730 • Problem: ocf16777216, setup: 17.21 s, time: 118.28 ms, ``mflops'': 17020
Ideas for improvement • Serial • Micro level • Low-level vectorization for cache line efficiency and SIMD • Macro level • Investigate large-radix steps (and include them in tuning?) • Stop wasting time waiting for memory • Multithreading -- even on one processor -- to compute while waiting? • Prefetching? • Parallel • What's the best initial radix? • What's the best number of communication steps to use in a transpose? • Can we do FFTW-esque self-tuning? • Latency-hiding by reordering work • Do half the computation in row-col order, other half in col-row order so that processors always have work to do • Exploit the fact that the DMA Engine/Fabric Switch is a "processor" that takes care of sending and receiving messages • Load-balancing by having pairs of processors work together at each step?
How we know there’s more to do • Serial • The edge-of-cache cliff: http://www.fftw.org/speed/CoreDuo-3.0GHz-icc64/ • Parallel • HPCC has adopted FFTE, which claims it's faster on large transforms • Whether or not that's true, everything is slow in parallel • SiCortex talk: 174 GFlops for HPCC FFTE(?) on 5832 500MHz procs • University of Tsukuba Center for Computational Sciences poster (2006):13 GFlops for FFTE on 32 3.0GHz procs • http://www.rccp.tsukuba.ac.jp/SC/sc2006/Posters/16_hpcs-ol.pdf • Intel Math Kernel Library: 25 GFlops on 64 dual-core 3.0GHz procs • Intel claims they’re winning on all fronts: • http://www.intel.com/cd/software/products/asmo-na/eng/266852.htm
Anyone want to help write a faster parallel FFT? • Let me know! • Joke of the day: You know you've been spending too much time thinking about parallel computing when… • You cite "multithreading between different class projects" as the reason for slow progress on each (and lament the fact that you have but one processor to apply to an embarrassingly parallel problem) • You start thinking about time wasted switching between tasks in terms of memory reads and writes, and it dawns on you that a cache-oblivious algorithm could help • You realize that LRU is a pretty good model for cramming (and wish you could achieve nanosecond latencies)