Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors Shuo Li Financial Services Engineering Software and Services Group Intel Corporation

Agenda • A Tale of Two Architectures • Transcendental Functions - Step 5 Lab 1/3 • Thread Affinity - Step 5 Lab part 2/3 • Prefetch and Streaming Store • Data Blocking – Step 5 Lab part 3/3 • Summary

A Tale of Two Architectures

A Tale of Two Architectures Multicore

Transcendental Functions

Extended Math Unit • Fast approximations of transcendental functions using hardware lookup table in single precision • Minimax quadratic polynomial approximation • Full effective 23-bit Mantissa bits • 1-2 cycles of throughput in 4 elementary functions • Benefit other functions that directly use them

Use EMU Functions in Finance Algorithms • Challenges in using exp2() and log2() • exp() and log() are widely use in Finance algorithm • the base of which are e (=2.718…) not 2 • The Change of Base Formula • M_LOG2E, M_LN2are defined in math.h • Cost: 1 multiplication, and its effect on result accuracy • In General, it works for any base b • expb(x) = exp2(x*log2b) logb(x) = log2(x) * logb2 • Manage the cost of conversion • Absorb multiply to other constant calculations outside loop • Always convert from other bases to base 2 exp(x) = exp2(x*M_LOG2E) log(x) = log2(x)*M_LN2

Black-Scholes can use EMU Functions constfloatLN2_V = M_LN2 * (1/V); constfloatRLOG2E =-R*M_LOG2E; for(int opt = 0; opt < OPT_N; opt++) { float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; floatrsqrtT = 1/sqrtf(T); floatsqrtT= 1/rsqrtT; float d1 = log2f(S/X)*LN2_V*rsqrtT + RVV*sqrtT; float d2= d1 - V * sqrtT; CNDD1 = CNDF(d1); CNDD2 = CNDF(d2); floatexpRT = X * exp2f(RLOG2E * T); floatCallVal = S * CNDD1 - expRT * CNDD2; CallResult[opt] = CallVal; PutResult[opt] = CallVal + expRT - S; } constfloat R = 0.02f; constfloat V = 0.30f; constfloat RVV = (R + 0.5f * V * V)/V; for(intopt = 0; opt < OPT_N; opt++) { floatT = OptionYears[opt]; floatX = OptionStrike[opt]; floatS = StockPrice[opt]; floatsqrtT = sqrtf(T); floatd1 = logf(S/X)/(V*sqrtT) + RVV*sqrtT; floatd2 = d1 - V * sqrtT; CNDD1 = CNDF(d1); CNDD2 = CNDF(d2); floatexpRT = X * expf(-R * T); floatCallVal = S * CNDD1 - expRT * CNDD2; CallResult[opt] = CallVal; PutResult[opt] = CallVal + expRT - S; }

Transcendental FunctionsStep 5 Lab 1/3

Using Intel® Xeon Phi™ Coprocessors • Use Intel® Xeon® E5 2670 platform with Intel® Xeon Phi™ Coprocessor • Make sure you can invoke Intel C/C++ Compiler • Try icpc –V to prinout the banner for Intel® C/C++ compiler • Build the native Intel® Xeon Phi™ Coprocessor application • Change the Makefile and use –mmic in lieu of –xAVX source /opt/intel/Compiler/2013.2.146/composerxe/pkg_bin/compilervars.sh intel64 • Copy program from Host to Coprocessor • Find out the host name using hostname(suppose it returns esg014) • Copy the executives using - scp ./MonteCarloesg014-mic0: • Establish a execution environment - sshesg014-mic0 • Set env. Variable %export LD_LIBRARY_PATH=. • Optionally export KMP_AFFINITY=“compact,granularity=fine” • Invoke the program ./MonteCarlo

Step 5 Transcendental Functions • Use transcendental functions in EMU • Inner loop calls expf(MuByT + VBySqrtT * random[pos]); • Call exp2f and adjust the parameter by a factor of M_LOG2E • Combine the multiplication with MuByT and VBySqrtT float VBySqrtT = VLOG2E * sqrtf(T[opt]); float MuByT = RVVLOG2E * T[opt];

Thread Affinity

More on Thread Affinity • Bind the worker threads to certain processor core/threads • Minimizes the thread migration and context switch • Improves data locality; reduce coherency traffic • Two components to the problem: • How many worker threads to create? • How to bind worker threads to core/threads? • Two ways to specify thread affinity • Environment variables OMP_NUM_THREADS, KMP_AFFINITY • C/C++ API: kmp_set_defaults("KMP_AFFINITY=compact")omp_set_num_threads(244); • Add to your source file#include <omp.h> • Compiler with –openmp • Use libiomp5.so

The Optimal Thread Number • Intel MIC maintains 4 hardware contexts per core • Round-robin execution policy, • Require 2 threads for decent performance • Financial algorithms takes all 4 threads to peak • Intel Xeon processor optionally use HyperThreading • Execute-until-stall execution policy • Truly compute intensive ones peak with 1 thread per core • Finance algorithms likes HyperThreading, 2 threads per core • Use OpenMP application with NCORE number of cores • Host only: 2 x ncore(or 1x if HyperThreading disabled) • MIC Native:4 x ncore • Offload: 4 x (ncore-1) OpenMP runtime avoids the core OS runs

Thread Affinity Choices • Intel® OpenMP* Supports the following Affinity Type: • Compactassign threads to consecutive h/w contexts on same physical core to achieve the benefit from shared cache. • Scatter assign consecutive threads to different physical cores maximize access to memory. • Balanced blend of compact & scatter (currently only available for Intel® MIC Architecture) • You can also specify affinity modifier • Explicit setting KMP_AFFINITY set to granularity=fine, proclist=“1-240”,explicit • Affinity is particularly important if not all available threads are used • Affinity Type is less important in full thread subscription

Thread Affinity in Monte Carlo • Monte Carlo can take all threads available to a core • Enable HyperThreading for Intel® Xeon® processor • Set –opt-threads-per-core=4 for the Coprocessor code • Affinity type is less important when all cores are used • Argument for compact: maximized the share random number effect • Argument for scatter: maximize the bandwidth to memory • However you have to set thread affinity type • API Calls: • Env. variables: #ifdef _OPENMP kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine"); #endif ~ $ export LD_LIBRARY_PATH=. ~ $ export KMP_AFFINITY="scatter,granularity=fine" ~ $ ./MonteCarlo

Thread Affinity Step 5 Lab 2/3

Step 5 Thread Affinity • Add #include <omp.h> • Add the following line before the very first #pragamomp #ifdef _OPENMP kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine"); #endif • Add –opt-threads-per-core =4 to the Makefile

Prefetch and Streaming Stores

Prefetch on Intel Multicore and Manycore platforms • Objective: Move data from memory to L1 or L2 Cache in anticipation of CPU Load/Store • More import on in-order Intel Xeon Phi Coprocessor • Less important on out of order Intel Xeon Processor • Compiler prefetching is on by default for Intel® Xeon Phi™ coprocessors at –O2 and above • Compiler prefetch is not enabled by default on Intel® Xeon® Processors • Use external options –opt-prefetch[=n] n = 1.. 4 • Use the compiler reporting options to see detailed diagnostics of prefetching per loop • Use -opt-report-phase hlo –opt-report 3

Automatic Prefetches Loop Prefetch • Compiler generated prefetches target memory access in a future iteration of the loop • Target regular, predictable array and pointer access Interactions with Hardware prefetcher • Intel® Xeon Phi™ Comprocessor has a hardware L2 prefetcher • If Software prefetches are doing a good job, Hardware prefetching does not kick in • References not prefetched by compiler may get prefetched by hardware prefetcher

Explicit Prefetch • Use Intrinsics • _mm_prefetch((char *) &a[i], hint);See xmmintrin.h for possible hints (for L1, L2, non-temporal, …) • But you have to specify the prefetch distance • Also gather/scatter prefetch intrinsics, see zmmintrin.h and compiler user guide, e.g. _mm512_prefetch_i32gather_ps • Use a pragma / directive (easier): • #pragma prefetch a [:hint[:distance]] • You specify what to prefetch, but can choose to let compiler figure out how far ahead to do it. • Use Compiler switches: • -opt-prefetch-distance=n1[,n2] • specify the prefetch distance (how many iterations ahead, use n1 and prefetchesinside loops. n1 indicates distance from memory to L2. • BlackScholes uses -opt-prefetch-distance=8,2

Streaming Store • Avoid read for ownership for certain memory write operation • Bypass prefetch related to the memory read • Use #pragma vector nontemporal (v1, v2, …) to drop a hint to compiler • Without Streaming Stores 448 Bytes read/write per iteration for(intchunkBase = 0; chunkBase < OptPerThread; chunkBase += CHUNKSIZE) { #pragmasimdvectorlength(CHUNKSIZE) #pragmasimd #pragma vector aligned #pragma vector nontemporal (CallResult, PutResult) for(int opt = chunkBase; opt < (chunkBase+CHUNKSIZE); opt++) { float CNDD1; float CNDD2; floatCallVal =0.0f, PutVal = 0.0f; float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; …… CallVal = S * CNDD1 - XexpRT * CNDD2; PutVal = CallVal + XexpRT - S; CallResult[opt] = CallVal ; PutResult[opt] = PutVal ; } } • With Streaming Stores, 320 Bytes read/write per iteration • Relief Bandwidth pressure; improve cache utilization • –vec-report6displaysthe compiler action bs_test_sp.c(215): (col. 4) remark: vectorization support: streaming store was generated for CallResult. bs_test_sp.c(216): (col. 4) remark: vectorization support: streaming store was generated for PutResult.

Data Blocking

Data Blocking • Partition data to small blocks that fits in L2 Cache • Exploit data reuse in the application. • Ensure the data remains in the cache across multiple uses • Using the data in cache remove the need to go to memory • Bandwidth limited program may execute at FLOPS limit • Simple case of 1D • Data size DATA_N is used WORK_N times from 100s of threads • Each handles a piece of work and have to traverse all data • Without Blocking • With Blocking • 100s of thread pound on different area of DATA_N • Memory interconnet limit the performance • Cacheable BSIZE of data is processed by all 100s threads a time • Each data is read once kept reusing until all threads are done with it for(intBBase= 0; BBase< DATA_N; BBase+= BSIZE) { #pragma omp parallel for for(intwrk = 0; wrk < WORK_N; wrk++) { initialize_the_work(wrk); for(intind= BBase; ind< BBase+BSIZE; ind++) { dataptrdatavalue= read_data(ind); result = compute(datavalue); aggregate[wrk] = combine(aggregate[wrk], result); } postprocess_work(aggregate[wrk]); } } #pragma omp parallel for for(intwrk= 0; wrk< WORK_N; wrk++) { initialize_the_work(wrk); for(intind= 0; ind< DATA_N; ind++) { dataptrdatavalue= read_data(dataind); result = compute(datavalue); aggregate = combine(aggregate, result); } postprocess_work(aggregate); }

Blocking in Monte Carlo European Options constintnblocks = RAND_N/BLOCKSIZE; for(int block = 0; block < nblocks; ++block) { vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF,Randomstream, BLOCKSIZE, random, 0.0f, 1.0f); #pragmaomp parallel for for(int opt = 0; opt < OPT_N; opt++) { floatVBySqrtT = VLOG2E * sqrtf(T[opt]); floatMuByT = RVVLOG2E * T[opt]; floatSval = S[opt]; floatXval = X[opt]; float val = 0.0, val2 = 0.0; #pragma vector aligned #pragma simd reduction(+:val) reduction(+:val2) #pragma unroll(4) for(intpos = 0; pos < BLOCKSIZE; pos++) { … … … val += callValue; val2 += callValue * callValue; } h_CallResult[opt] += val; h_CallConfidence[opt] += val2; } } #pragmaomp parallel for for(intopt = 0; opt < OPT_N; opt++) { constfloatval = h_CallResult[opt]; constfloat val2 = h_CallConfidence[opt]; constfloat exprt = exp2f(-RLOG2E*T[opt]); h_CallResult[opt] = exprt * val * INV_RAND_N; … … … h_CallConfidence[opt] = (float)(exprt * stdDev * CONFIDENCE_DENOM); } • Each thread runs Monte Carlo using all random num. • Random num. are too big to fit in each thread’s L2 • Random num size is RAND_N * sizeof(float) = 1 MB, at 256K floats • Each thread’s L2 is 512KB/4 = 128KB effective: 64KB or 16K floats • Without Blocking: • Each thread make independent pass of RAND_N data for all options it runs • Interconnects is busy satisfying the read req. from different threads at different points • Also prefetched data from different points saturate the memory bandwidth • With Blocking • Random number is partitioned into cacheable blocks • A block is brought to cache when its previous block is done processing by all threads. • It remains in the caches of all threads until all thread have finish their passes • Each threads repeatedly reuse the data in cache for all options it need to runs.

Data BlockingStep 5 Lab 3/3

Move the loop to calculate per option data from middle loop to outside the loop #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { const float val=h_CallResult[opt]; const float val2=h_CallConfidence[opt]; const float exprt=exp2f(-RLOG2E*T[opt]); h_CallResult[opt]= exprt*val*INV_RAND_N; const float stdDev= sqrtf((F_RAND_N * val2 - val * val) * STDDEV_DENOM); h_CallConfidence[opt]= (float)(exprt * stdDev * CONFIDENCE_DENOM); • Add initialization loop before the triple nested loops #pragma omp parallel for for(intopt = 0; opt < OPT_N; opt++) { h_CallResult[opt] = 0.0f; h_CallConfidence[opt] = 0.0f; } Step 5 Data Blocking • 1D Data blocking to Random nums. • L2: 521K per core, 128K per threads • Your application can use 64KB • BLOCKSIZE = 16*1024 const int nblocks = RAND_N/BLOCKSIZE; for(block=0; block<nblocks; ++block){ vsRngGaussian(VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, BLOCKSIZE, random, 0.0f, 1.0f); <<<Existing Code>> // Save intermediate result here h_CallResult[opt] += val; h_CallConfidence[opt] += val2; } • Change Inner loop from RAND_N to BLOCKSIZE

Run your program on Intel® Xeon Phi™ Coprocessor • Build the native Intel® Xeon Phi™ Coprocessor application • Change the Makefile and use –mmicin lieu of –xAVX • Copy program from Host to Coprocessor • Find out the host name using hostname(if it retuns esg014) • Copy the executives using scpMonteCarloesg014-mic0: • Establish a execution environment sshesg014-mic0 • Set env. Variable %export LD_LIBRARY_PATH=. • Invoke the program ./MonteCarlo

Summary

Summary • Base 2 exponent and logarithmic functions are always faster than other bases on Intel multicore and manycore platforms • Set thread affinity when you use OpenMP* • Allow hardware prefetcher to work for you. Fine tuning your loop with prefetcher pragma/directives • Optimize your data access and rearrange your computation based on data in cache

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors