National Sun Yat-sen University Embedded System Laboratory

National Sun Yat-sen University Embedded System Laboratory A Generic Platform for Estimation of Multi-threaded Program Performance on Heterogeneous Multiprocessors Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, 2009. DATE ‘09.

Abstract • This paper deals with a methodology for software estimation to enable design space exploration of heterogeneous multiprocessor systems. Starting from fork-join representation of application specification along with high level description of multiprocessor target architecture and mapping of application components onto architecture resource elements, it estimates the performance of application on target multiprocessor architecture. The methodology proposed includes the effect of basic compiler optimizations, integrates light weight memory simulation and instruction mapping for complex instruction to improve the accuracy of software estimation. To estimate performance degradation due to contention for shared resources like memory and bus, synthetic access traces coupled with interval analysis technique is employed. The methodology has been validated on a real heterogeneous platform. Results show that using estimation it is possible to predict performance with average errors of around 11%.

What’s the problem? • There are many mappings between application and hardware architecture. • How to know the mapping we used is the best one? • We need a performance estimator to estimate the performance of the mapping. • So, when estimation, we have three input : • Application specification • task, data and communication • Architecture specification • processor, memory and bus • Mapping description • Mapping application components onto architecture components P1 P2 P2 P1 0 A A B B C Time Time C D D Saved time 10

Related Works Fork-join task graph SUIF Compiler HMDES( High Level Machine Description ) Represent parallel phase of computation Software profiling Target processor description includes pipeline stages, memory, link, … Application specification Architecture specification This paper

Proposed Estimation Method • Estimation of mapped task on a processor. • Estimation of communication and synchronization delays of multi-threaded tasks. • Estimation of contention delays of shared resources.

Estimation of Mapped Task on Uni-Processor • Introduction Processor #1 Processor #2 Multi-threadedApplication Communication, Synchronization, Resource Contention fork … …… Processor #n tasks Example of Multi-threaded Application running on Multi-processors.

Estimation of Mapped Task on Uni-Processor • Estimation Input – Application Specification • Fork-join task graph : A task graph consisting of alternating sequential and parallel phases consists of independent tasks. • Vertex : a task which is a unit of work in a parallel program. • Edge : precedence between a pair of tasks Parallel phase Sequential phase Examples of fork-join task graph.

Estimation of Mapped Task on Uni-Processor Register Allcation bb : basic block L : latency T : total time F : frequency

Experimental – Uni-Processor Estimation • Estimated processor • Cradle PE • Leon3 with FPU • SS-mips • Estimated and actual execution cycles • Average error rate : 14 %

Estimation of communication andsynchronization delays of multi-threaded tasks • A application may be composed of some sequential pre/post-processing and nested fork-joins. • A fork-join may be iterated for many times. Includes the shared resource contention delay

Interval Analysis • Interval analysis  generate access rate data : request arrival rate : time interval : request served rate

Interval Analysis – Delay Calculation • Spreading of interval At next time slot : P1 P2 P3 P4

Experiment – Interval Analysis (1) ( Time Interval )

Experiment – Interval Analysis (2) ( Time Interval )

Experimental HW • Architecture of Cradle CT3400 heterogeneous multi-processor chip. • 4 processors • 4 DSE (Digital Signal Engine)

Experimental SW and Mapping • JPEG application The mapping description Parallelism Parallelism Parallelism Parallelism P : processor D : DSE (Digital Signal Engine)

Experimental Results • Estimated cycles for 8 mappings of JPEG application over Cradle architecture. • Estimated cycles without contention delay must lower than the others.

Conclusion and Comment • Conclusion • The presented framework for retargetable performance estimation of multi-threaded applications on heterogeneous multi-processors. • The estimated performance includes shared resource contention delay, task execution time on uni-processor. • Comment • The mapping of multi-threaded application component to hardware architecture is very important for improving performance. • The error rate is not good.

National Sun Yat-sen University Embedded System Laboratory