1 / 23

Case studies in Optimizing High Performance Computing Software

Case studies in Optimizing High Performance Computing Software. Jan Westerholm High performance computing Department of Information Technologies Faculty of Technology / Åbo Akademi University. FINHPC / Åbo Akademi Objectives. Sub-project in FINHPC

ike
Download Presentation

Case studies in Optimizing High Performance Computing Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance computing Department of Information Technologies Faculty of Technology / Åbo Akademi University

  2. FINHPC / Åbo Akademi Objectives • Sub-project in FINHPC • Three year duration 01.07.2005-30.06.2008 • Objective: to improve code individuals and research groups have written and are running on CSC machines • faster code, with in many cases exactly the same numerical results as before • ability to run bigger problems • Work approach: apply well known techniques from computer science • Faster programs may imply better quality for results • Better throughput for everybody

  3. FINHPC / Åbo AkademiLimitations • We will use: • parallelization techniques • code optimization • cache utilization (particularly L2-cache) • microprocessor pipeline continuity • data blocking: grid scan order • introduction of new data structures • replacement of very simple algorithms • sorting (quicksort instead of bubble sort) • open source libraries

  4. FINHPC / Åbo AkademiLimitations • We will not: • introduce better physics, chemistry, etc. • replace chosen basic numerical technique • replace individual algorithms unless they are clearly modularized (matrix inversion as library routine)

  5. 3 case studies • Lattice-Boltzmann fluid simulation : 3DQ19 • Protein covariance analysis: Covana • Fusion reactor simulation: Elmfire

  6. 3DQ19: Lattice Boltzmann fluid mechanics • Jyväskylä University / Jussi Timonen, Keijo Mattila; ÅA / Anders Gustafsson • Physical background: • phase space distribution simulated in time • Boltzmann's equation: drift term and collision term • physical quantities = moments of distribution

  7. 3DQ19: Program Profiling Flat profile: % cumulative self self total time seconds seconds calls ms/call ms/call name 33.96 43.65 43.65 50 873.00 1230.10 everything2to1() 30.79 83.22 39.57 50 791.40 1148.50 everything1to2() 27.79 118.93 35.71 49000000 0.00 0.00 relaxation_BGK() 2.30 121.89 2.96 shmem_msgs_available 1.19 123.42 1.53 100 15.30 15.30 send_west() 1.11 124.85 1.43 100 14.30 14.30 send_east() 0.82 125.91 1.06 recv_message 0.45 126.49 0.58 sock_msg_avail_on_fd 0.37 126.97 0.48 100 4.80 4.80 per_bound_xslice() 0.33 127.40 0.43 1 430.00 430.00 init_fluid() 0.31 127.80 0.40 1 400.00 400.00 local_profile_y() 0.23 128.10 0.30 socket_msgs_available 0.19 128.34 0.24 1 240.00 240.00 calc_mass() 0.04 128.39 0.05 net_recv 0.03 128.43 0.04 1 40.00 40.00 allocation() 0.02 128.46 0.03 main

  8. 3DQ19: Optimizations • Parallelization: well done already! • Code optimization • blocking: grid scan order • anti-dependency: make blocks of code independent • deep fluid: mark those grid points which do not have solids as neighbours

  9. 3DQ19: Blocking

  10. 3DQ19: Results on three parallel systems Athlon 1800 IBMSC AMD64 everything1to2(): 18,8 19,48 10,06 everything2to1(): 19,34 18,78 10,52 send_west(): 8,4 0,68 1,96 send_east(): 8,31 1,17 3,14 Total time (s): 55,15 40,28 25,76 Time gained (s): 27,48 14,13 14,76 Speed up (%): 33% 26% 36%

  11. 2nd case study: Covana Protein Covariance analysis • Institute of Medical Technology, University of Tampere / Mauno Vihinen, Bairong Chen; ÅA / André Norrgård • Biological background • physico-chemical groups of amino acids • protein function from structure • pair and triple correlations between amino acids • web server for covariance analysis

  12. Covana: Protein covariance analysis • Protein sequences: calculate correlations between columns of amino acids • Typical size • 50-150 sequences (rows) • 300-1500 amino acids in a sequence (columns) >Q9XW32_CAEEL/9-307 IDVTKPTFLLTFYSIHGTFALVFNILGIFLIMK-NPKIVKMYKGFMINMQ-ILSLLADAQ TTLLMQPVYILPIIGGYTNGLLWQVFR----LSSHIQMAMF---LLLLY---------LQ VASIVCAIVTKYHVVSNIGKLSDRSI-LFWIF---VIVYHGCAFVITGFFSVS-CLARQ- -EEENLIK------T-KFPNAISVFTLEN--VAIYDLQVN---KWMMITTILFAFMLTSS IVISFY--FSVRLLKTLPSKRNTISARSFRGHQIAVTSLM-AQAT-VPFLVL---IIP-- IGTIVYLFVHVLP------NAQ-----EISNIMMAV--YSFHASLST---FVMIISTPQY

  13. Covana: Code optimization • Effective data structures: dynamic memory allocation • Effective generic algorithms: sorting • Avoid recalculations

  14. Covana: Run time

  15. Covana: Results • Runtime: • Original : 227.8 s • Final Version : 2.0 s • Improvement : 112 times faster • Computer memory usage: • Original : 3250 MB • Final Version : 37 MB • Improvement : 88 times less. • Disk space usage: • Original : 277 MB • Final version : 21 MB • Improvement : 13 times less.

  16. 3rd study case: ELMFIRE Tokamak fusion reactor simulation • Jukka Heikkinen, Salomon Janhunen, Timo Kiviniemi / Advanced Energy Systems / HUT; ÅA / Artur Signell • Physical background: • particle simulation with averaged gyrokinetic Larmor orbits • turbulence and plasma modes

  17. Elmfire: Tokamak fusion reactor simulation • Goal 1: Computer platform independence • replacing proprietary library routines for random number generation with open source routines • replacing proprietary library routines for distributed solution of sparse linear systems with open source library routines • Goal 2: Scalability • Elmfire ran on at most 8 processors • new data structures for sparse matrices were invented, which make element updates efficient

  18. Elmfire

  19. Elmfire

  20. Conclusions • Software can be improved! • modern microprocessor architecture is taken into account: • cache utilization • pipeline • use of well-established computer science methods

  21. Conclusions • In 1 case out 3, a clear impact on run time was made • In 2 cases out of 3, previously intractable results can now be obtained • Are these three cases representative of code running on CSC machines? • the next two cases are under study!

  22. What have we learnt? • Computer scientists with minimal prior knowledge of e.g. physical sciences can contribute to HPC • Are supercomputers needed to the extent they are used today at CSC? • Interprocess communication often a bottleneck • Parallel computing with 1000 processors may become routine in the future for certain types of problems • Who should do the coding? • Code for production use (intensive cycles of use, maintainability) should be outsourced?

  23. Co-workers: • Mats Aspnäs, Ph.D • Anders Gustafsson, M.Sc. • Artur Signell, M.Sc. • André Norrgård THANK YOU!

More Related