1 / 27

InCoB2007 - August 30, 2007 - HKUST

“Speedup Bioinformatics Applications on Multicore-based Processor using Vectorizing & Multithreading Strategies”. InCoB2007 - August 30, 2007 - HKUST. Kridsadakorn Chaichoompu kridsadakorn.cha@biotec.or.th. Dr. Sissades Tongsima. Dr. Surin Kittitornkun.

donagh
Download Presentation

InCoB2007 - August 30, 2007 - HKUST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “Speedup Bioinformatics Applications on Multicore-based Processor using Vectorizing & Multithreading Strategies” InCoB2007 - August 30, 2007 - HKUST • Kridsadakorn Chaichoompu • kridsadakorn.cha@biotec.or.th Dr. Sissades Tongsima • Dr. Surin Kittitornkun • National Center for Genetic Engineering and Biotechnology, Thailand King Mongkut’s Institute of Technology, Ladkrabang, Thailand

  2. Outline • Introduction • Case Study • Existing works • Speedup of our approach • Comparison • Discussion • Our strategies • Limitation • Conclusion

  3. Motivation • New modern processors are launched • How to make a use of new technologies? Quad-core CPU Dual-core CPU

  4. Motivation [2] • What is the difference between old and new CPUs? Dual-core, Max. speedup ~2x Quad-core, Max. speedup ~4x

  5. Problems • Old sequential software is still used? • Yes, especially the science and bioinformatics tools • Why do the scientists still use? • Mostly they care about novel algorithms and knowledge. They don't care about speed • Why don't we use the PC cluster? • Very expensive, consume much more electric power. You don't need the PC cluster if you want to use a small software for searching, matching or grouping data

  6. Our Contribution • The hardware was changed, Old sequential software should be changed. To harness the power of the new multicore architecture certain compiler techniques must be considered • Using a popular ClustalW application as our case study, the optimization and multithreading techniques were applied to speedup ClustalW

  7. Case Study: ClustalW ClustaW is a general purpose multiple alignment program for DNA or proteins.

  8. ClustalW example S1 ALSK S2 TNSD S3 NASK S4 NTSD Multiple Alignment Steps -ALSK -TNSD NA-SK NT-SD -ALSK NA-SK • 1. Align S1 with S3 • 2. Align S2 with S4 • 3. Align (S1, S3) with (S2, S4) -TNSD NT-SD Multiple Alignment All pairwise alignments Neighbor Joining Distance Matrix

  9. Existing works • ClustalW-MPI: ClustalW analysis using distributed and parallel computing • K.B. Li, Bioinformatics 19, 2003 • Parallel MSA: Parallel Multiple Sequence Alignment with Dynamic Scheduling • J. Luo, I. Ahmad, M. Ahmed and R. Paul, ITCC’05 • SGI: Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL • D. Mikhailov, Haruna C., and R. Gomperts, SGI ChemBio

  10. Running mode* Elapsed times (ms)‏ Overall speedup Distance Matrix Neighbor Joining Progressive Alignment Test data - 800 sequences, 1000 amino acids I 11,918,672 932,718 333,110 - II 10,387,046 881,125 338,016 1.14 III 9,656,750 880,969 327,985 1.21 IV 7,009,875 511,047 252,984 1.70 V 5,900,891 473,359 253,188 1.98 VI 5,472,407 474,109 244,672 2.12 *Note: Running mode defines as follows: (I) ClustalW without optimization (II) ClustalW with optimization (III) ClustalW with optimization and our assist (IV) MT-ClustalW without optimization (V) MT-ClustalW with optimization (VI) MT-ClustalW with optimization and our assist Speedup of our approach Data set  Protein sequences from NCBI Run time: from 3 h. 40 m. down to 1 h. 43 m.

  11. ClustalW Speedup of the optimized versions of ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.

  12. Multithreaded ClustalW • Speedup of the optimized versions of MT-ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.

  13. Comparison • Why does the speedup is over 2x? • Because of the special unit in the new CPU • Does the special unit normally work with common software? • No, we have to activate it.

  14. Speedup > 2x for dual-CPU? [1] Amdahl’s Law S  Speedup

  15. Speedup > 2x for dual-CPU? [2] Speedup 1.21 Speedup 1.70 Data set  800 sequences, 1000 amino acids

  16. Our strategies • Step 1: Analyzing and Profiling • To find the software structure and where the bottle neck is • Step 2: Applying the methodologies • Multithreading & Vectorizing (one of the optimization method) • Step 3: Validating • To compare the result with the original one. For sure, the result is not changed

  17. Strategy: Multithreading • The Proposed Multithreading Strategy • To improve the bottle neck of the software which is non-threaded part • To rise the throughput of the program by applying multithreading strategy • Reduce the overhead of thread creation

  18. Profile the software Profiled by Intel Thread Profiler Distance matrix Neighbor joining Progressive alignment

  19. Implementation Apply the Thread library for this loop

  20. Trick T1 T2 T2 T4 Reduce Thread Creation Overhead 4 Threads P1 P2 P3 P4 P5 P6 P7 P8 Parameters P9 P10 P11 P12

  21. Strategy: Vectorizing • Proposed Optimizing and Vectorizing Methodology • Find the frequent used functions in the program • Applying the Loop Optimizing Methodologies • Use the advantage of Intel C++ Compiler to optimize the code, also enable vectorizing option

  22. Frequent used functions Profiled by Intel VTune

  23. Loop Reversal • That is to run a loop backward. Reversal of for loops is always legal, since the execution is not defined in terms of the order of the index set.

  24. Loop Fission • A single loop can be broken into two or more smaller loops. Loop fission can break up the block of conditionally executed statements.

  25. Limitation • Available compliers and programming languages • C/C++  Intel C++ complier (Windows, Linux, Mac) • Fortran  Intel Fortran complier (Windows, Linux, Mac) • Available processors • CPU with Hyper-thread technology or above (Intel, AMD)

  26. Conclusion • Generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++ • Proposed framework: multithreading and vectorizing strategies • Higher speedup by taking the advantage of multicore architecture technology • Proposed optimization could be more appropriate than making use of parallelization on a small cluster computer

  27. Thank you Questions?

More Related