230 likes | 401 Views
Musepack Encoder Performance Tuning. Tal Rath and Eyal Enav May 2008 Technion Softlab. Agenda. Project goals Project description What is Musepack? Using multithreading approach Applying SIMD Analyzing Micro-architecture problems Results – Speedup overview
E N D
Musepack EncoderPerformance Tuning Tal Rath and Eyal Enav May 2008 Technion Softlab
Agenda • Project goals • Project description • What is Musepack? • Using multithreading approach • Applying SIMD • Analyzing Micro-architecture problems • Results – Speedup overview • Conclusions and recommendations • Our benefits • Next Steps
Project goals • Speeding up and optimizing a Musepack encoder while maintaining a bitwise output compatibility: • Examining the encoder’s structure and methods. • Analyzing encoder functions time distribution using Intel’s Vtune program. • Apply multithreading, SIMD instructions and other techniques in order to achieve speedup using Vtune. • Returning the code back to open source community.
Project description • Project Platform: • Intel Core 2 Duo,2.4Ghz,64 Bit, 2 GB of RAM. • Windows XP OS. • Speedup measurement:
Project description • What is Musepack? • Musepack is an open source audio codec. • It is a lossy encoder. • Musepack has performed well in various listening tests at both lower and higher bitrates.
Using Multithreading approach • Thread Level Parallelism technique is used to reduce program execution time by executing multiple code sections on both cores simultaneously. • Amdahl’s law – if P is the proportion of parallel program, then the maximum speedup that can be achieved by using 2 processors is: Therefore, P should be maximized. • Intel’s Vtune wasused to target appropriate time consuming functions for multithreading.
Using Multithreading approach • Functions’ total timer events: • Psychoakustic_Modell’s time consumption is high, therefore, should be a target for multithreading.
Multithreading Psychoakustic Function – First Attempt • Function contains two separate models with same instructions and different data. Each model should be executed in a different thread.
Multithreading Psychoakustic – First attempt • Problem: Very high dependency between models through local and global variables: Second model uses first one’s output.
Multithreading Psychoakustic – Second attempt • Observation: Psychoakustic function contains left and right channel handling functions. • These functions can be divided into two types: • Single channelfunctions, for example: FunctionL(Left Param1,Left Param2,.., local param1,Local param2) . • Dual channel functions, for example: FunctionLR(Left Param1,Right Param1,…) • Single channel functions does not access opposite channel’s local variables. • Timer events distribution: • Single – 84% • Dual - 16%
Multithreading Psychoakustic – Second attempt • Strategy: • One single channel function in each thread: Left Left Left Left Thread A Two Single channel functions Two Single channel functions Thread B Right Dual channel function Dual channel function Time
Multithreading Psychoakustic – Second attempt • Implementation: • Left channel local variables uses thread A while right ones uses thread B. Shared variables, used by both threads, are being duplicated – one copy for each thread. • Technical problem: Program contains a large amount of global variables. • These are being accessed by both left and right single channel functions and supposed to be accessed from both threads simultaneously. A, About, ANSspec,_LANSspec_,MANSspec_,RANSspec,_S, APE_Version, array, b, Bandwidth, Buffer, BufferBytes, BufferedBits, bump_exp, bump_start, Butfly, __C ,c, Ci_,opt ,CombPenalities, Cos,_Tab ,CosWin, CP_10000 ,CP_10079, CP_1250, CP_1251 CP_1252 CP_1253 CP_1254 CP_1255 CP_1256 CP_1257 CP_1258 CP_37, CP_42, CP_437 , CP_500, CVD_used, __D , d ,data,_finishedDelInputDisplayUpdateTime
Multithreading Psychoakustic- Global Variables • Solution - “Divide and Conquer approach”: • Map all globals - Using globals marking script. • Duplicate globals with which are being accessed by functions in the deepest level of function call. • After these functions are handled, proceed to a higher level. • Process ends when the duplication of global variables, which are being accessed from within the Upper level (Psychoakustic self code), is done. Psychoakustic() Upper level Aligned 64 duplicated struct{float g_var1;} … … Function A(thread num) { struct. g_var1 = value; } float g_var1 (global/static var) … … Function A() { g_var1 = value; } aligned 64 structs (to avoid shared cache lines). Deepest level
Multithreading - Results • After Psychoakustic multithreading, two more functions have been multithreaded, using the same mechanism. Total threading speedup: 1.43X • Parallel part: 73.2%. • Assuming serial part does not change, new exec time of multithreaded part is 57% from it’s original time. • Threading overhead: • Total program IC increased by 2.6%. • Total timer event count increased by 0.62%. • Intel Thread Checker found no errors. (Thread Profiler)
Floating Point Issues • Original encoder settings uses “Precise F.P. model” instead of “Fast mode F.P. model”. • Precise mode increases calculation time. • F.P. model was changed to “fast” (after consulting our instructors). • In the original program, sqrt instructions with single F.P. arguments was performed in double precision. • These instructions were changed to single precision. • Speedup gained so far: 1.77X • Output file has a bitwise compatibility only with original “Fast F.P. mode” file: • Around of value difference from “Precise mode” output is due to rounding. • Such minor differences can not be noticed by human ear.
SIMD Instructions • SIMD is a technique employed to achieve data level parallelism, SIMD instructions enable the execution of 4 F.P. instructions at a time. • Function self time distribution: Sqrt function is the main target for SIMD Instructions usage.
SIMD Instructions - implementation • SIMD instructions were used in the four functions that call Sqrt instruction. • These functions were transformed into SIMD oriented functions – sqrt as well as other mathematical operations were performed by SIMD instructions. • In one of the functions, due to altering loop iteration number, Sqrt array was calculated in advance using SIMD instructions. • No calls to original Sqrt remained after applying SIMD. • SIMD Gained Speedup: 23% (With multithreading).
Micro Architecture Issues • Using VTune’s Tuning Assistance, several micro architecture problems were discovered: • RAT_STALLS.FLAGS – Indicates Partial flag stalls. • About Events, each one causes ~10 cycles stalls ~4 sec. • Possible solution: command substitution such as INC to ADD. • Events occur in ‘fread’ function, therefore can not be modified. • LOAD_BLOCK.OVERLAP_STORE – load instructions are blocked, Cause can be 4K (Page size) aliasing or load-store block overlap. • Possible solution: increase 4K sized arrays by block size and use 64 Byte alignment. • Solution was applied – Results are Unnoticeable.
Speedup Overview 2.03
Conclusions • Multithreading • Can produce a significant program acceleration. • Global variables can be an obstacle in the process of multithreading. • SIMD instructions • Enhance speedup. • Can be implemented only on specific code parts. • Sometimes, implementation should be “creative”. • Micro architecture • In this Program no major problems were found. • Vtune tuning assistance is a powerful tool for micro architecture problems tracking.
Optional Future Steps • Making adjustments for quad core processor by creating 4 threads. • Designing a multithreading assistance program that will trace and handle global variables using suggested algorithm.
Our Benefit • Improving our expertise for identifying the dominant factors in a process and handling it. • Enhancing our knowledge regarding multithreading technique. • Learning how to use SIMD instructions. • Being exposed to a few micro architecture problems.
The End Thank you