270 likes | 626 Views
Getting Reproducible Results with Intel® MKL 11.0. Todd Rosenquist Technical Consulting Engineer Intel® Math Kernel Library. The agenda. Reproducible results in Intel MKL The symptom The problem The reality The requirements A conditional solution A beginner’s guide Performance
E N D
Getting Reproducible Results with Intel® MKL 11.0 Todd Rosenquist Technical Consulting Engineer Intel® Math Kernel Library
The agenda • Reproducible results in Intel MKL • The symptom • The problem • The reality • The requirements • A conditional solution • A beginner’s guide • Performance • Further resources • Try the feature in the recently released Intel® MKL 11.0
Ever seen something like this? • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222
…or this? • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275
Why do results vary? • Root cause for variations in results • floating-point numbers order of computation matters! • double precision example where (a+b)+c a+(b+c) 2-63 + 1 + -1 = 2-63 (infinitely precise result) (2-63 + 1) + -1 0 (correct IEEE single precision result) 2-63 + ( 1 + -1) 2-63 (correct IEEE single precision result) Order matters when doing floating point arithmetic.
Why does the order of operations change in Intel MKL? Many optimizations require a change in order of operations.
Why are reproducible results important for Intel MKL users? • Technical/legacy Software correctness is determined by comparison to previous ‘gold’ results. • Debugging When developing and debugging, a higher degree of run-to-run stability is required to find potential problems • Legal Accreditation or approval of software might require exact reproduction of previously defined results. • Customer perception Developers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies. Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf
Balancing Reproducibility and Performance:Conditional Numerical Reproducibility (CNR) New! Goal: Achieve best performance possible for cases that require reproducibility
Why “Conditional”? • In Intel MKL 11.0 reproducibility is currently available under certain conditions: • Within single operating systems / architecture • Reproducibility only applies within the blue boxes, not between them… • Reproducibility on all supported servers and workstations • No support yet for Intel® Xeon Phi™ coprocessors • Within a particular version of Intel MKL • Results in version 11.0 update 1 may differ from results in version 11.0 • Reproducibility controls in Intel MKL only affect Intel MKL functions
Conditions for reproducibility • Aligned input and output arrays in function calls • 16-byte alignment for the family of SSE instruction sets • 32-byte alignment for AVX • 64-byte alignment for future processors <- choose this to be safe • Set the same number of computational threads for the library in each run • Use the same Intel MKL parameters from run-to-run • Example: You cannot call a function in 3 blocks in one run and 4 blocks in the next • Use the new functions & controls to ensure deterministic task scheduling and to control code paths • CNR controls must be set or called before any computational math functions in Intel MKL
Example - COMPATIBLE • For reproducible results on Intel and Intel-compatible CPUs supporting SSE2 instructions or later • function call mkl_cbwr_set(MKL_CBWR_COMPATIBLE) • or environment variable set MKL_CBWR="COMPATIBLE" • Note: MKL_CBWR_COMPATIBLE is provided because Intel and Intel compatible CPUs have approximation instructions (e.g., rcpps/rsqrtps) that may return different results. This option ensures that Intel MKL uses a SSE2-only codepath that does not contain any of these instructions.
Example – SSE2 • For the same results on every Intel processor that supports SSE2 instructions or later • function call mkl_cbwr_set(MKL_CBWR_SSE2) • or environment variable set MKL_CBWR="SSE2" • Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported
Example – SSE4.2 • For the same results on every Intel processor that supports SSE4.2 instructions or later • function call mkl_cbwr_set(MKL_CBWR_SSE4_2) • or environment variable set MKL_CBWR= "SSE4_2" • Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported
Example – deterministic task scheduling • For consistent results on all supported processors without fixing the code branch • function call mkl_cbwr_set(MKL_CBWR_AUTO) • or environment variable set MKL_CBWR= "AUTO" • Note • This will ensure deterministic task scheduling • It will not give you reproducibility from processor to processor
Example – Find out the best performing option from a pool of processors • For the best option given a pool of computing resources in a grid setting, you may launch a simple program as follows #include <mkl.h> int main(void) { int my_cbwr_branch; /* Find the available MKL_CBWR_BRANCH */ my_cbwr_branch= mkl_cbwr_get_auto_branch(); if (!mkl_cbwr_set(my_cbwr_branch)) { printf(“Error in setting branch. Aborting…\n”); return;} return my_cbwr_branch; } • Examine all results and use mkl_cbwr_set(<minimum_result>) The full list of options: COMPATIBLE 3 SSE2 4 SSE3 5 SSSE3 6 SSE4_1 7 SSE4_2 8 AVX 9 AVX2 10
Change this sort of inconsistency… • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • Align memory • Constant # of threads • Turn on CNR with either • mkl_cbwr_set(MKL_CBWR_AUTO) • or • set MKL_CBWR=AUTO • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111
Change this inconsistency in results… • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275
…to get reproducible results? • Align memory • Constant # of threads • Turn on CNR with either… • mkl_cbwr_set(MKL_CBWR_SSE4_2) • or • set MKL_CBWR=SSE4_2 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 Intel® Xeon® Processor E5540 (Supporting SSE4.2 instructions) Intel® Xeon® Processor E3-1275 (Supporting AVX instructions)
What’s next? https://softwareproductsurvey.intel.com/survey/150072/1afd/
Further resources on conditional numerical reproducibility • Intel MKL Documentation – online and in the product • Intel MKL User’s Guide • Reference Manual • Knowledgebase articles on CNR • Support • Intel MKL user forum • Intel Premier support • Feedback • Survey: https://softwareproductsurvey.intel.com/survey/150072/1afd/
New optimizations and features • Support for the Intel® Xeon Phi™ coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture) on Linux* only • Optimizations using the new Intel® Advanced Vector Extensions 2 (AVX2) including the new FMA3 instructions • FFTs: Completed support for real-to-complex transforms with sizes given by 64-bit integers • Local threading control function • mkl_set_num_threads_local()
Sept 18th, 2012 9:00AM • Interesting ties between tools and new hardware features: How Intel Tools support the many new features in processors and coprocessors • Oct 2nd, 2012 9:00AM • Pointer Checker: Catch Out-of-Bounds Memory Accesses Easily! • Oct 16th, 2012 9:00AM • How Intel® Parallel Studio XE is used to improve the HMMER application • Oct 30th, 2012 9:00AM • Using the Intel® Math Kernel Library 11.0 and Compiler to Obtain Run-to-Run Reproducible Results • Oct 9th, 2012 9:00AM • Achieving better parallel performance of Fortran programs with Intel® VTune™ Amplifier XE profiling. • Oct 23rd, 2012 9:00AM • Three common Fortran mistakes you can avoid by using Intel® Inspector XE • Nov 6th, 2012 9:00AM • Avoid common parallelization mistakes with the help of Intel® Advisor XE • Dec 4th, 2012 9:00AM • Fortran 2008 Standard Parallel Programming Features in Intel® Fortran Composer XE* http://software.intel.com/en-us/fall-webinar-series-psxe-and-fsxe
Summary • Conditional Numerical Reproducibility (CNR) provides: • reproducible results from run-to-run • reproducible results from processor-to-processor • the ability to balance reproducibility requirements with great performance Evaluate CNR in the following: Intel® Math Kernel Library 11.0 Intel® Composer XE 2013 Intel® Parallel Studio XE 2013 Intel® Cluster Studio XE 2013 Provide feedback: https://softwareproductsurvey.intel.com/survey/150072/1afd/