1 / 27

Getting Reproducible Results with Intel® MKL 11.0

Getting Reproducible Results with Intel® MKL 11.0. Todd Rosenquist Technical Consulting Engineer Intel® Math Kernel Library. The agenda. Reproducible results in Intel MKL The symptom The problem The reality The requirements A conditional solution A beginner’s guide Performance

sandro
Download Presentation

Getting Reproducible Results with Intel® MKL 11.0

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Getting Reproducible Results with Intel® MKL 11.0 Todd Rosenquist Technical Consulting Engineer Intel® Math Kernel Library

  2. The agenda • Reproducible results in Intel MKL • The symptom • The problem • The reality • The requirements • A conditional solution • A beginner’s guide • Performance • Further resources • Try the feature in the recently released Intel® MKL 11.0

  3. Ever seen something like this? • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222

  4. …or this? • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275

  5. Why do results vary? • Root cause for variations in results • floating-point numbers  order of computation matters! • double precision example where (a+b)+c a+(b+c) 2-63 + 1 + -1 = 2-63 (infinitely precise result) (2-63 + 1) + -1  0 (correct IEEE single precision result) 2-63 + ( 1 + -1)  2-63 (correct IEEE single precision result) Order matters when doing floating point arithmetic.

  6. Why does the order of operations change in Intel MKL? Many optimizations require a change in order of operations.

  7. Why are reproducible results important for Intel MKL users? • Technical/legacy Software correctness is determined by comparison to previous ‘gold’ results. • Debugging When developing and debugging, a higher degree of run-to-run stability is required to find potential problems • Legal Accreditation or approval of software might require exact reproduction of previously defined results. • Customer perception Developers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies. Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf

  8. Balancing Reproducibility and Performance:Conditional Numerical Reproducibility (CNR) New! Goal: Achieve best performance possible for cases that require reproducibility

  9. Why “Conditional”? • In Intel MKL 11.0 reproducibility is currently available under certain conditions: • Within single operating systems / architecture • Reproducibility only applies within the blue boxes, not between them… • Reproducibility on all supported servers and workstations • No support yet for Intel® Xeon Phi™ coprocessors • Within a particular version of Intel MKL • Results in version 11.0 update 1 may differ from results in version 11.0 • Reproducibility controls in Intel MKL only affect Intel MKL functions

  10. Conditions for reproducibility • Aligned input and output arrays in function calls • 16-byte alignment for the family of SSE instruction sets • 32-byte alignment for AVX • 64-byte alignment for future processors <- choose this to be safe • Set the same number of computational threads for the library in each run • Use the same Intel MKL parameters from run-to-run • Example: You cannot call a function in 3 blocks in one run and 4 blocks in the next • Use the new functions & controls to ensure deterministic task scheduling and to control code paths • CNR controls must be set or called before any computational math functions in Intel MKL

  11. Example - COMPATIBLE • For reproducible results on Intel and Intel-compatible CPUs supporting SSE2 instructions or later • function call mkl_cbwr_set(MKL_CBWR_COMPATIBLE) • or environment variable set MKL_CBWR="COMPATIBLE" • Note: MKL_CBWR_COMPATIBLE is provided because Intel and Intel compatible CPUs have approximation instructions (e.g., rcpps/rsqrtps) that may return different results. This option ensures that Intel MKL uses a SSE2-only codepath that does not contain any of these instructions.

  12. Example – SSE2 • For the same results on every Intel processor that supports SSE2 instructions or later • function call mkl_cbwr_set(MKL_CBWR_SSE2) • or environment variable set MKL_CBWR="SSE2" • Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported

  13. Example – SSE4.2 • For the same results on every Intel processor that supports SSE4.2 instructions or later • function call mkl_cbwr_set(MKL_CBWR_SSE4_2) • or environment variable set MKL_CBWR= "SSE4_2" • Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported

  14. Example – deterministic task scheduling • For consistent results on all supported processors without fixing the code branch • function call mkl_cbwr_set(MKL_CBWR_AUTO) • or environment variable set MKL_CBWR= "AUTO" • Note • This will ensure deterministic task scheduling • It will not give you reproducibility from processor to processor

  15. Example – Find out the best performing option from a pool of processors • For the best option given a pool of computing resources in a grid setting, you may launch a simple program as follows #include <mkl.h> int main(void) { int my_cbwr_branch; /* Find the available MKL_CBWR_BRANCH */ my_cbwr_branch= mkl_cbwr_get_auto_branch(); if (!mkl_cbwr_set(my_cbwr_branch)) { printf(“Error in setting branch. Aborting…\n”); return;} return my_cbwr_branch; } • Examine all results and use mkl_cbwr_set(<minimum_result>) The full list of options: COMPATIBLE 3 SSE2 4 SSE3 5 SSSE3 6 SSE4_1 7 SSE4_2 8 AVX 9 AVX2 10

  16. Change this sort of inconsistency… • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • Align memory • Constant # of threads • Turn on CNR with either • mkl_cbwr_set(MKL_CBWR_AUTO) • or • set MKL_CBWR=AUTO • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111

  17. Change this inconsistency in results… • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 • C:\Users\me>test.exe • 4.012345678902222 Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275

  18. …to get reproducible results? • Align memory • Constant # of threads • Turn on CNR with either… • mkl_cbwr_set(MKL_CBWR_SSE4_2) • or • set MKL_CBWR=SSE4_2 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 • C:\Users\me>test.exe • 4.012345678901111 Intel® Xeon® Processor E5540 (Supporting SSE4.2 instructions) Intel® Xeon® Processor E3-1275 (Supporting AVX instructions)

  19. What’s next? https://softwareproductsurvey.intel.com/survey/150072/1afd/

  20. Further resources on conditional numerical reproducibility • Intel MKL Documentation – online and in the product • Intel MKL User’s Guide • Reference Manual • Knowledgebase articles on CNR • Support • Intel MKL user forum • Intel Premier support • Feedback • Survey: https://softwareproductsurvey.intel.com/survey/150072/1afd/

  21. New optimizations and features • Support for the Intel® Xeon Phi™ coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture) on Linux* only • Optimizations using the new Intel® Advanced Vector Extensions 2 (AVX2) including the new FMA3 instructions • FFTs: Completed support for real-to-complex transforms with sizes given by 64-bit integers • Local threading control function • mkl_set_num_threads_local()

  22. Sept 18th, 2012 9:00AM • Interesting ties between tools and new hardware features: How Intel Tools support the many new features in processors and coprocessors • Oct 2nd, 2012 9:00AM • Pointer Checker: Catch Out-of-Bounds Memory Accesses Easily! • Oct 16th, 2012 9:00AM • How Intel® Parallel Studio XE is used to improve the HMMER application • Oct 30th, 2012 9:00AM • Using the Intel® Math Kernel Library 11.0 and Compiler to Obtain Run-to-Run Reproducible Results • Oct 9th, 2012 9:00AM • Achieving better parallel performance of Fortran programs with Intel® VTune™ Amplifier XE profiling. • Oct 23rd, 2012 9:00AM • Three common Fortran mistakes you can avoid by using Intel® Inspector XE • Nov 6th, 2012 9:00AM • Avoid common parallelization mistakes with the help of Intel® Advisor XE • Dec 4th, 2012 9:00AM • Fortran 2008 Standard Parallel Programming Features in Intel® Fortran Composer XE* http://software.intel.com/en-us/fall-webinar-series-psxe-and-fsxe

  23. Summary • Conditional Numerical Reproducibility (CNR) provides: • reproducible results from run-to-run • reproducible results from processor-to-processor • the ability to balance reproducibility requirements with great performance Evaluate CNR in the following: Intel® Math Kernel Library 11.0 Intel® Composer XE 2013 Intel® Parallel Studio XE 2013 Intel® Cluster Studio XE 2013 Provide feedback: https://softwareproductsurvey.intel.com/survey/150072/1afd/

More Related