Musepack Encoder Performance Tuning

Musepack EncoderPerformance Tuning Tal Rath and Eyal Enav May 2008 Technion Softlab

Agenda • Project goals • Project description • What is Musepack? • Using multithreading approach • Applying SIMD • Analyzing Micro-architecture problems • Results – Speedup overview • Conclusions and recommendations • Our benefits • Next Steps

Project goals • Speeding up and optimizing a Musepack encoder while maintaining a bitwise output compatibility: • Examining the encoder’s structure and methods. • Analyzing encoder functions time distribution using Intel’s Vtune program. • Apply multithreading, SIMD instructions and other techniques in order to achieve speedup using Vtune. • Returning the code back to open source community.

Project description • Project Platform: • Intel Core 2 Duo,2.4Ghz,64 Bit, 2 GB of RAM. • Windows XP OS. • Speedup measurement:

Project description • What is Musepack? • Musepack is an open source audio codec. • It is a lossy encoder. • Musepack has performed well in various listening tests at both lower and higher bitrates.

Using Multithreading approach • Thread Level Parallelism technique is used to reduce program execution time by executing multiple code sections on both cores simultaneously. • Amdahl’s law – if P is the proportion of parallel program, then the maximum speedup that can be achieved by using 2 processors is: Therefore, P should be maximized. • Intel’s Vtune wasused to target appropriate time consuming functions for multithreading.

Using Multithreading approach • Functions’ total timer events: • Psychoakustic_Modell’s time consumption is high, therefore, should be a target for multithreading.

Multithreading Psychoakustic Function – First Attempt • Function contains two separate models with same instructions and different data. Each model should be executed in a different thread.

Multithreading Psychoakustic – First attempt • Problem: Very high dependency between models through local and global variables: Second model uses first one’s output.

Multithreading Psychoakustic – Second attempt • Observation: Psychoakustic function contains left and right channel handling functions. • These functions can be divided into two types: • Single channelfunctions, for example: FunctionL(Left Param1,Left Param2,.., local param1,Local param2) . • Dual channel functions, for example: FunctionLR(Left Param1,Right Param1,…) • Single channel functions does not access opposite channel’s local variables. • Timer events distribution: • Single – 84% • Dual - 16%

Multithreading Psychoakustic – Second attempt • Strategy: • One single channel function in each thread: Left Left Left Left Thread A Two Single channel functions Two Single channel functions Thread B Right Dual channel function Dual channel function Time

Multithreading Psychoakustic – Second attempt • Implementation: • Left channel local variables uses thread A while right ones uses thread B. Shared variables, used by both threads, are being duplicated – one copy for each thread. • Technical problem: Program contains a large amount of global variables. • These are being accessed by both left and right single channel functions and supposed to be accessed from both threads simultaneously. A, About, ANSspec,_LANSspec_,MANSspec_,RANSspec,_S, APE_Version, array, b, Bandwidth, Buffer, BufferBytes, BufferedBits, bump_exp, bump_start, Butfly, __C ,c, Ci_,opt ,CombPenalities, Cos,_Tab ,CosWin, CP_10000 ,CP_10079, CP_1250, CP_1251 CP_1252 CP_1253 CP_1254 CP_1255 CP_1256 CP_1257 CP_1258 CP_37, CP_42, CP_437 , CP_500, CVD_used, __D , d ,data,_finishedDelInputDisplayUpdateTime

Multithreading Psychoakustic- Global Variables • Solution - “Divide and Conquer approach”: • Map all globals - Using globals marking script. • Duplicate globals with which are being accessed by functions in the deepest level of function call. • After these functions are handled, proceed to a higher level. • Process ends when the duplication of global variables, which are being accessed from within the Upper level (Psychoakustic self code), is done. Psychoakustic() Upper level Aligned 64 duplicated struct{float g_var1;} … … Function A(thread num) { struct. g_var1 = value; } float g_var1 (global/static var) … … Function A() { g_var1 = value; } aligned 64 structs (to avoid shared cache lines). Deepest level

Multithreading - Results • After Psychoakustic multithreading, two more functions have been multithreaded, using the same mechanism. Total threading speedup: 1.43X • Parallel part: 73.2%. • Assuming serial part does not change, new exec time of multithreaded part is 57% from it’s original time. • Threading overhead: • Total program IC increased by 2.6%. • Total timer event count increased by 0.62%. • Intel Thread Checker found no errors. (Thread Profiler)

Floating Point Issues • Original encoder settings uses “Precise F.P. model” instead of “Fast mode F.P. model”. • Precise mode increases calculation time. • F.P. model was changed to “fast” (after consulting our instructors). • In the original program, sqrt instructions with single F.P. arguments was performed in double precision. • These instructions were changed to single precision. • Speedup gained so far: 1.77X • Output file has a bitwise compatibility only with original “Fast F.P. mode” file: • Around of value difference from “Precise mode” output is due to rounding. • Such minor differences can not be noticed by human ear.

SIMD Instructions • SIMD is a technique employed to achieve data level parallelism, SIMD instructions enable the execution of 4 F.P. instructions at a time. • Function self time distribution: Sqrt function is the main target for SIMD Instructions usage.

SIMD Instructions - implementation • SIMD instructions were used in the four functions that call Sqrt instruction. • These functions were transformed into SIMD oriented functions – sqrt as well as other mathematical operations were performed by SIMD instructions. • In one of the functions, due to altering loop iteration number, Sqrt array was calculated in advance using SIMD instructions. • No calls to original Sqrt remained after applying SIMD. • SIMD Gained Speedup: 23% (With multithreading).

Micro Architecture Issues • Using VTune’s Tuning Assistance, several micro architecture problems were discovered: • RAT_STALLS.FLAGS – Indicates Partial flag stalls. • About Events, each one causes ~10 cycles stalls ~4 sec. • Possible solution: command substitution such as INC to ADD. • Events occur in ‘fread’ function, therefore can not be modified. • LOAD_BLOCK.OVERLAP_STORE – load instructions are blocked, Cause can be 4K (Page size) aliasing or load-store block overlap. • Possible solution: increase 4K sized arrays by block size and use 64 Byte alignment. • Solution was applied – Results are Unnoticeable.

Speedup Overview 2.03

Conclusions • Multithreading • Can produce a significant program acceleration. • Global variables can be an obstacle in the process of multithreading. • SIMD instructions • Enhance speedup. • Can be implemented only on specific code parts. • Sometimes, implementation should be “creative”. • Micro architecture • In this Program no major problems were found. • Vtune tuning assistance is a powerful tool for micro architecture problems tracking.

Optional Future Steps • Making adjustments for quad core processor by creating 4 threads. • Designing a multithreading assistance program that will trace and handle global variables using suggested algorithm.

Our Benefit • Improving our expertise for identifying the dominant factors in a process and handling it. • Enhancing our knowledge regarding multithreading technique. • Learning how to use SIMD instructions. • Being exposed to a few micro architecture problems.

The End Thank you

Musepack Encoder Performance Tuning

Musepack Encoder Performance Tuning

Presentation Transcript

NVision Performance Tuning

Apache Performance Tuning

Demystifying Performance Tuning

Performance Tuning

Performance Tuning

Performance Tuning SSIS

EXTREME PERFORMANCE TUNING

Apache Performance Tuning

Practical Performance Tuning

Apache Performance Tuning

Performance Tuning

Performance Tuning

ABAP Performance Tuning

nVision Performance Tuning

Apache Performance Tuning

Performance Tuning

Database Performance Tuning

Performance Tuning

Apache Performance Tuning

Apache Performance Tuning

Performance and Tuning

Automatic Performance Tuning