陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Dataflow Execution of Sequential Imperative Programs on Multicore Architectures在多核心架構上以序列式命令程式的資料流執行陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Introduction • 提出一個嶄新的執行模型(execution model) • 使靜態地序列程式(statically-sequential programs)能平行執行 • 該模型之程式容易開發, 且具平行性 • 藉由一般的命令式程式語言(ex. C++), • 則可達成靜態循序程式之資料流(dataflow)平行執行 • 在多個處理核心以資料流之方式動態地平行化執行序列程式

Dataflow Execution of Sequential Imperative Programs (1/6) • 以資料流(dataflow)模型執行資料驅動(data-driven)執行的程式 • 相較於控制流(control flow)是序列地執行指令 • 取而代之的是當指令的運算元可使用時, 則執行該指令 • 資料相依的指令會自動地依序執行, 而獨立的指令將可平行執行 • 為了能成功地在多核心環境採用相似的模型 • 列出一些已知和新的挑戰 • 實際的程式設計典範 • 相依性 • 資源管理 • 多核心環境下的應用原則

Dataflow Execution of Sequential Imperative Programs (2/6) • Dataflow on Multicores • 傳統之資料流機器(dataflow machines)是指令層級平行(ILP) • 在此將粒度提升至functions, 此已被用於task-level計算 • 可促進程式碼之重用 • 更適合於多核心之規模 • 藉由資料流之方式將function執行於各個core上 • 達成Function-LevelParallelism(FLP)

Dataflow Execution of Sequential Imperative Programs (3/6) • Dataflow on Multicores • 在執行時, 每次依序執行循序程式之一個function(而不是一個指令) • 在執行function前須先確認運算元 • 將運算元由個別之暫存器或記憶體位址擴展至object • 一個function之輸入運算元: read set • 其輸出運算元為: write set • 統稱為:data set • C++ STL

Dataflow Execution of Sequential Imperative Programs (4/6) • Dataflow on Multicores • 每個data set裡的object都有一個身分 • 用於建立各個function間之資料相依 • 判定是否目前要被執行之function與正在執行之任何function(s)是否相依 • 如果沒相依,submitted (or delegated) 至某一核心來執行 • 如果相依, 將該function shelved直至沒有相依 • 上述兩種情況, 皆會繼續執行下一個function(循序程式)

Dataflow Execution of Sequential Imperative Programs (5/6) • Handling Dependences • 使用token來處理資料流機制的相依性 • 採用類似一般資料流機制的技巧, 但有兩個關鍵的改進 • 一 • 傳統的是token與個別記憶體位址關聯取而代之的是把token與objects關聯, 來達到資料抽象 • 二 • 每一個object配置多個read tokens • 只有一個write token • 藉此管理資料之建立與使用

Dataflow Execution of Sequential Imperative Programs (6/6) • Handling Dependences • 當function以dataflowexecution執行時須 • Function會要求read(write) tokens給function裡object的read(write) set • 當function取得所需之tokens後, 則已準備好且可執行 • 一旦function執行完畢, 將會放棄所有tokens給shelved且需要該token的function • 當shelved function取得所需tokens後, 將unshelved且執行 • 此Model亦可以循序方式執行某種function • 即當在該function程式順序前的運算完成後才會執行該function • 且後續運算亦須等該function完成後才可執行

Dataflow Execution of Sequential Imperative Programs (1/7) • Model Overview - Example Figure 1. (a) Example pseudocode that invokes functions T and T’. T: {write set} {read set} modifies (reads) objects in its write set(read set). Data set of T’ is unknown. Figure 1. (b) Dynamic invocations of the functions T and T’ , in the program order, and the data set of each invocation.

Dataflow Execution of Sequential Imperative Programs (2/7) • Model Overview – Data dependence between the functions WAR RAW WAW A T3 D T5 T2 T1 T6 D B T4 B A Time (c) Dataflow graph of the dynamic function stream.

Dataflow Execution of Sequential Imperative Programs (3/7) • Model Overview – Execution of the code as per model– t1 A T1成功取得 read token 給 object A, and write tokens 給B &C >submitted for execution T1 T3 P1 B A B T2成功取得 read token 給 object A, and write token 給D > submitted for execution T2 T5 T2 T4 T’ P1 D D T6成功取得 read token 給 object H, and write token 給G > submitted for execution T6 T6 P1 Barrier Time t1 t3 t4 t2 Figure 1. (d) Dataflow execution schedule of the function stream

Dataflow Execution of Sequential Imperative Programs (4/7) • Model Overview – Execution of the code as per model – t2 A T1 T3 P1 B A T1 完成執行 > 釋出 write token B & C > T4 取得 write token B 但缺read token D B T2 T5 T2 T4 T’ P1 D D T6 T6 P1 Barrier Time t1 t3 t4 t2

Dataflow Execution of Sequential Imperative Programs (5/7) • Model Overview – Execution of the code as per model – t3 A T1 T3 P1 B A B T2 T5 T2 T4 T’ P1 D D T6 T6 P1 Barrier Time t1 t3 t4 t2 T2 執行完畢 > 釋出 write token D, and read token A > T3 取得 write token A, 開始運行 > T4 取得 read token D, 開始運行

Dataflow Execution of Sequential Imperative Programs (6/7) • Model Overview – Execution of the code as per model–t4 A T1 T3 P1 B A B T2 T5 T2 T4 T’ P1 D D T6 T6 P1 Barrier Time t1 t3 t4 t2 T4 執行完成 > 釋出 write token B, and read token D > T5 取得 write token B, start execution

Dataflow Execution of Sequential Imperative Programs (7/7) • Model Overview – Execution of the code as per model–after t4 A T1 T3 P1 B A B T2 T5 T2 T4 T’ P1 D D T6 T6 P1 Barrier Time t1 t3 t4 t2 T' will be submitted for execution after all previous functions complete.

Dataflow Execution of Sequential Imperative Programs (1/1) • Deadlock Avoidance • 如果二或多個functions建立環狀的tokens相依,token機制在一般的dataflowmodel將會有deadlock發生 • 例如, 調用T4與T5可能建立一個要求順序 • T4: 取得B → T5: 等待B → T5: 取得D → T4: 等待D > 導致deadlock • 避免token deadlocks 須確保: • (i) 某token一次僅能被一個function要求 • (ii) 依照function要求的順序將tokens給予object(先要先給) • 因此T5的token僅能在T4之後索取 WAR RAW WAW D T5 T2 T1 D B D T4 B B

Prototype Implementation (1/4) • 以C++ runtime library開發執行模型之軟體雛形 • Static Sequential Program • 為了完整表示命令式語言, 此模型允許程式在資料流與序列執行間切換 • 目前模型之程式是以C++編寫, 如同傳統的序列程式 • 特別地,使用者須知道 • functions間潛在地平行 • Objects在functions間之共享 • Read set 與 write set

Prototype Implementation (2/4) • Static Sequential Program • Dataflow Functions • Library 提供 df_execute 介面可供程式中之funciton平行執行 • df_execute是以C++ templates 實作之 runtimefunction • 非df_execute之指令將以特定的順序執行 • Shared data, • in the form of global, passed-by-reference objects or pointers to them, that are accessed by a function are passed to it as arguments. • Users group them into two sets, one that may be modified (write set) and another that is only read (read set). • The C++ STL-based set data structure of the token base class is used to create them. Figure 2. Example program in the proposed model.

Prototype Implementation (3/4) • Static Sequential Program • Serial Segments/Functions • 使用者可透過df_endinterface來返回序列執行 • df_endinterface相似於barrier, 使程式之執行脫離dataflow execution • 在df_endinterface前的程式執行完後, 往後的程式將以序列執行 • T’ in our example. Figure 2. Example program in the proposed model.

Prototype Implementation (4/4) • Static Sequential Program • Serial Segments/Functions • 為使共享之object在主程式裡依順序執行, 提供df_seqinterface • df_seq accepts the object instance, the function (object method) pointer and any arguments to it. • df_seq需等先前有使用該object者執行完畢, 才可執行, • 且後面之程式須等df_seq執行完才可繼續執行 • df_seqcauses the runtime to suspend the main program context until the associated function finishes operating on the specified object. • Execution will proceed from line 6 only after print finishes, • but potentially in parallel with other (prior) functions (that are not accessing G). Figure 2. Example program in the proposed model.

Runtime Mechanics (1/10) • 採用多執行緒來實作該機制, 其平行執行是以PthreadAPI實現 • 將執行緒管理抽象化, 讓使用者不會直接接觸 • 使用者不須了解機制之底層架構 • Executing Function on Processing Cores • At the start of a program, the runtime creates threads, usually one per hardware context available to it. • A double-ended work queue (deque) is then assigned to each thread in the system. • Computations are scheduled for execution by a thread by queuing them in the corresponding work deque.

Runtime Mechanics (2/10) • Discovering Functions for Parallel Execution • 剛開始執行時, 僅一個processor在等待工作抵達它的deque, 其他processors皆為閒置狀態 • 執行初期相似於序列執行 • 當遇到df_execute,runtime 將被啟動

Runtime Mechanics (3/10) • Discovering Functions for Parallel Execution • The runtime processes a dataflow function in three decoupled phases, • prelude • execute • postlude Figure 3. Logical view of runtime operations to process a dataflow function.

Runtime Mechanics (4/10) • Discovering Functions for Parallel Execution • In the prelude phase • Dereferences pointers to objects in the read/write sets, if need be, and attempts to acquire the tokens • Execute phase • Successful acquisition of tokens leads to the execute phase (Figure 3: 2), in which the function is delegated for (potentially parallel) execution • Specifically, the runtime pushes the program continuation (remainder of the program past the df_execute call) onto the thread's work deque, and executes the function on the same thread.

Runtime Mechanics (5/10) • Discovering Functions for Parallel Execution • A task-stealing scheduler • Running on each hardware context, will cause an idle processor to stealthe program continuation and continue its execution, until it encounters the next df_execute, repeating the process of delegation and pushing of the program continuation onto its work deque. • Thus the execution of the program unravels in parallel with executing functions, and possibly on different hardware contexts rather than on one hardware context.

Runtime Mechanics (6/10) • Tokens and Dependency Tracking • 在程式執行期間, • 被配置的object會有 • 一個 write token • 無限多的read tokens (limited only by the number of bits used to represent tokens), • 一個 wait list • Tokens are acquired for objects that the dataflow functions operate on • Released when the functions complete

Runtime Mechanics (7/10) • Tokens and Dependency Tracking • A token may be granted only if it is available. • Figure 4a gives the definition of availability of read and write tokens, • and Figure 4b shows the token acquisition protocol. • The wait list is used to track functions to which the token could not be granted at the time of their requests. • A non-empty wait list signifies pending requests, in the enlisted order. • An available token is not granted if an earlier function enqueued in the wait list is waiting to acquire it (Figure 4b: 1).

Runtime Mechanics (8/10) • Tokens and Dependency Tracking Figure 4. The token protocol: (b) Read/Write token acquisition Figure 4. The token protocol: (a) Definition of availability

Runtime Mechanics (9/10) • Shelving Functions/Program Continuations • 如果一個function 的tokens無法取得的話 • functionis enqueued in the wait lists of all the objects for which tokens could not be acquired (Figure 4b: 4 or 5), • and subsequently shelved (Figure 3: 1.2). • While the shelved function waits for the dependences to resolve, • the runtime looks for other independent work from the program continuation to perform Figure 3. Figure 4. The token protocol: (b) Read/Write token acquisition

Runtime Mechanics (10/10) • Completion of Function Execution Figure 3. Logical view of runtime operations to process a dataflow function. Figure 4. The token protocol:(c) Token release

Example Execution (1/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 0 E R = 0 F R = 0 G R = 0 CPU0 CPU1 CPU2 Figure 5. Example execution

Example Execution (2/15) H R = 0 A R = 1 B R = 0 C R = 0 D R = 0 E R = 0 F R = 0 G R = 0 W: T1 W: T1 CPU0 T1 CPU1 CPU2 Figure 5. Example execution

Example Execution (3/15) H R = 0 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 0 G R = 0 W: T1 W: T1 W: T2 CPU0 T1 steal execution CPU1 T2 CPU2 Figure 5. Example execution

Example Execution (4/15) H R = 0 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T2 W: T3 W: T1 W: T1 T3 CPU0 T1 CPU1 T2 steal execution CPU2 Figure 5. Example execution

Example Execution (5/15) H R = 0 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T2 W: T3 W: T1 W: T1 T4 T4 T3 CPU0 T1 CPU1 T2 CPU2 Figure 5. Example execution

Example Execution (6/15) H R = 0 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T2 W: T3 W: T1 W: T1 T5 T4 T5 T4 T3 CPU0 T1 CPU1 T2 CPU2 Figure 5. Example execution

Example Execution (7/15) H R = 1 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T6 W: T2 W: T3 W: T1 W: T1 T5 T4 T5 T4 T3 CPU0 T1 CPU1 T2 CPU2 T6 Figure 5. Example execution

Example Execution (8/15) H R = 1 A R = 1 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T6 W: T2 W: T3 W: T4 T4 T5 T5 T3 CPU0 Can’t execute CPU1 T2 CPU2 T6 Figure 5. Example execution

Example Execution (9/15) H R = 1 A R = 1 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T6 W: T2 W: T3 W: T4 T4 T5 T5 de_seq T3 CPU0 CPU1 T2 CPU2 T6 df_seqcauses the runtime to shelve the program continuation beyond df_seq in G’s wait list and await completion of all functions accessing G Figure 5. Example execution

Example Execution (10/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T3 W: T4 T4 T5 T5 T3 CPU0 CPU1 CPU2 Figure 5. Example execution

Example Execution (11/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 2 E R = 0 F R = 1 G R = 0 W: T3 W: T3 W: T4 T5 CPU0 T3 CPU1 CPU2 Figure 5. Example execution

Example Execution (12/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 2 E R = 0 F R = 1 G R = 0 W: T3 W: T3 W: T4 T5 CPU0 T3 CPU1 T4 CPU2 T5 will be scheduled for execution once T4 completes Figure 5. Example execution

Example Execution (13/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 2 E R = 0 F R = 1 G R = 0 W: T3 W: T3 W: T4 T5 CPU0 T3 CPU1 T4 CPU2 G.print() After print completes, the runtime schedules the program continuation for execution Figure 5. Example execution

Example Execution (14/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 1 E R = 0 F R = 0 G R = 0 W: T5 CPU0 T5 CPU1 CPU2 The continuation is shelved again, thus preventing further processing of the program,until all in-flight functions finish Figure 5. Example execution

Example Execution (15/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 0 E R = 0 F R = 0 G R = 0 CPU0 T’ CPU1 CPU2 Figure 5. Example execution

Conclusion • Presented a novel execution model that achieves function-level parallel execution of statically-sequential imperative programs on multicore processors. • Parallel tasks (program functions) are dynamically extracted from a sequential program • and executed in a dataflow fashion on multiple processing cores using tokens associated with shared data objects, • and employing a token protocol to manage the dependences between tasks. • Thus combine the benefits of sequential programming and dataflow execution.

陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Presentation Transcript

Computer Music Researches in NTUEE

Multi-Layer Traffic Engineering in IP over Optical Networks

Chapter 2 Optical Signal Generation

Link layer behavior of body area networks at 2.4 GHz

Testing Semiconductor Memories

CH2 The Meaning of the Constitutive Equation

Design and Analysis of Clinical Trials

802307 Electrical Engineering for Civil Engineers

An-Najah National University Faculty of Engineering Building Engineering Department

Rainfall-Runoff Modeling (2)

802307 Electrical Engineering for Mechinacal Engineers

When East meet West

An- Najah National University College of Engineering Department of Civil Engineering

Bioinformatics for Proteomics Shu-Hui Chen ( 陳淑慧 ) Department of Chemistry

Non-Metric Methods

Reporter: H.C. Shieh Adviser: Dr. J.T. Chen Department of Harbor and River Engineering,

Logic Programming

Chapter 14 Topic Tracking, Detection, and Summarization: Some IE Applications

Natural Language Processing

Chapter 13 Chinese Information Extraction Technologies

Introduction to Power Engineering [EEE281]

Testing Semiconductor Memories