110 likes | 127 Views
This review explores the performance of two different baseline SIMD implementations: one using naïve transpose and fused multiply-add operations. The results are analyzed based on machine specifications and performance measurements.
E N D
CS295: Modern SystemsLab 1 Review Sang-Woo Jun Spring, 2019
Baseline SIMD Implementation • Naïve transpose + Fused Multiply-Add FMA Non-SIMD Add A BT C … … … × =
Baseline SIMD Implementation • Naïve transpose + Fused Multiply-Add
Baseline SIMD Implementation • Naïve transpose + Fused Multiply-Add
Baseline Tiled SIMD Implementation • Naïve transpose + Temporary C of size 8N to delay non-SIMD addition • Fixed tile size (64 elements) • Not optimizing for cache size, or the existence of L2+ cache CT BT A Non-SIMD Add FMA … … … × =
Multithreaded • Naïve, single-threaded transpose + Temporary C of size 8N to delay non-SIMD addition • Round-robin row block thread assignment For 2 threads: BT A C … … Thread 0 … … Thread 1 × = … … Thread 0 …
Machine Specs • Machine 1 • Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz • 6 Cores, 12 Threads • 32 GB DRAM • 4 DIMMs DDR4 • Machine 2 • Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz • 2 Cores, 4 Threads • 8 GB DRAM • 1 DIMM DDR4
Results on Machine 2Performance Normalized Against Naïve What happened here?
Two Different Ways to Do Blocking BT Option 1 A C … … … … × = … … Option 2 N*N/Threads… Doesn’t fit in cache! N/Threads × =