Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed

Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed Ye Feng IMAGeDAReSSIParCS University of Wyoming

Introduction • Task: evaluating the feasibility and effectiveness of coprocessor on DART. • Target: get_close_obs ( profiling result: computationally intensive & executed multiple times during a typical DART run.) • Coprocessor: NVDIA GPUs with CUDA Fortran. • Result: Parallel version of exhaustive search on GPU is faster.

Problem Calculate: the horizontal distances between base location and observation locations.

maxdist Find: the close observations.

cdist cclose_ind EASY! or is it?

It is easy on CPU Data Dependency • But GPU doesn’t work this way! • Problems with data dependency usually don’t scale so well on GPU. • cnum_close depends on previous cnum_close value. • cclose_ind and cdist both depend on cnum_close.

GPU Scan: psum dist diff Take the 1st bit of - maxdist = Prefix Sum cnum_close 1, dist<maxdist (close) Most Significant Bit (1st bit) 0, dist>maxdist (not close)

GPU Scan: Diff_sum cdist diff dist diff psum

Extract: cclose_ind cdist cdist Diff_sum Thread ID What we want What we have How can we independently eliminate the zeros and extract the indices

Solution? If diff .not. 0 Then cclose_ind=Thread ID diff Diff_sum Thread ID cclose_ind If diff = 0 Then throw it away

If diff .not. 0 Then cclose_ind=Thread ID diff Diff_sum Thread ID cclose_ind NO Branching! If diff = 0 Then throw it away

Solution! Thread ID Diff_sum cdist cclose_ind cdist cnum_close

Device Functions: • gpu_dist • gpu_scan • si: number of iterations performed in this kernel. • extract • sn: number of gpu_scan blocks that each extract block in this kernel handles. si=2 8 threads/block 16 element/block dist array: Block 1 Block 2 Result from gpu_scan: sn=4

Conclusion • CUDA Fortran on GPU gave significant speedup vs CPU (10x+). • Step outside the box (redesign the algorithm). • In order to get good performance, si and sn need to be tuned. • Be careful with using device memory. • There’s still room to improve the performance of this project.

Acknowledgements DAReS/IMAGe Helen Kershaw (Mentor) Nancy Collins (Mentor) Jeff Anderson Tim Hoar Kevin Raeder UCAR NCARUniversity of Wyoming Kristin Mooney Silvia Gentile Carolyn Mueller Richard Loft Raghu Raj Prasanna Kumar

Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed

Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed

Presentation Transcript

Evaluating Teacher Effectiveness

Models for Evaluating Teacher Effectiveness

Evaluating Curriculum Effectiveness

Evaluating the Effectiveness of the Organization

Models for Evaluating Teacher Effectiveness

Evaluating principal effectiveness

Data Assimilation for the Space Environment

Evaluating Teacher Effectiveness

Evaluating snow data assimilation methods for use in distributed models

The Data Assimilation research testbed (DART) for ecological forecasting

Data Assimilation for Convection

Models for Evaluating Teacher Effectiveness

What the research tells us about evaluating teacher effectiveness

Evaluating Management Effectiveness:

BACY = Basic Cycling A COSMO Data Assimilation Testbed for Research and Development

Data assimilation

Open Source Ensemble Kalman Filtering: the Data Assimilation Research Testbed - DART

Research Methods for Working with Helsinki Testbed Data

Testbed for Integrating and Evaluating Learning Techniques

Evaluating Teacher Effectiveness

Evaluating Website Effectiveness

Evaluating Psychotherapy’s Effectiveness