CS295: Modern Systems Lab 1 Review

CS295: Modern SystemsLab 1 Review Sang-Woo Jun Spring, 2019

Baseline SIMD Implementation • Naïve transpose + Fused Multiply-Add FMA Non-SIMD Add A BT C … … … × =

Baseline SIMD Implementation • Naïve transpose + Fused Multiply-Add

Baseline Tiled SIMD Implementation • Naïve transpose + Temporary C of size 8N to delay non-SIMD addition • Fixed tile size (64 elements) • Not optimizing for cache size, or the existence of L2+ cache CT BT A Non-SIMD Add FMA … … … × =

Baseline Tiled SIMD Implementation

Multithreaded • Naïve, single-threaded transpose + Temporary C of size 8N to delay non-SIMD addition • Round-robin row block thread assignment For 2 threads: BT A C … … Thread 0 … … Thread 1 × = … … Thread 0 …

Machine Specs • Machine 1 • Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz • 6 Cores, 12 Threads • 32 GB DRAM • 4 DIMMs DDR4 • Machine 2 • Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz • 2 Cores, 4 Threads • 8 GB DRAM • 1 DIMM DDR4

Results on Machine 1Performance Normalized Against Naïve

Results on Machine 2Performance Normalized Against Naïve What happened here?

Two Different Ways to Do Blocking BT Option 1 A C … … … … × = … … Option 2 N*N/Threads… Doesn’t fit in cache! N/Threads × =

CS295: Modern Systems Lab 1 Review