460 likes | 632 Views
Dataflow Execution of Sequential Imperative Programs on Multicore Architectures 在 多核心架構上以序列式命令程式的資料流執行. 陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C. Introduction. 提出一個嶄新的執行模型 (execution model )
E N D
Dataflow Execution of Sequential Imperative Programs on Multicore Architectures在多核心架構上以序列式命令程式的資料流執行 陳品杰 Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C
Introduction • 提出一個嶄新的執行模型(execution model) • 使靜態地序列程式(statically-sequential programs)能平行執行 • 該模型之程式容易開發, 且具平行性 • 藉由一般的命令式程式語言(ex. C++), • 則可達成靜態循序程式之資料流(dataflow)平行執行 • 在多個處理核心以資料流之方式動態地平行化執行序列程式
Dataflow Execution of Sequential Imperative Programs (1/6) • 以資料流(dataflow)模型執行資料驅動(data-driven)執行的程式 • 相較於控制流(control flow)是序列地執行指令 • 取而代之的是當指令的運算元可使用時, 則執行該指令 • 資料相依的指令會自動地依序執行, 而獨立的指令將可平行執行 • 為了能成功地在多核心環境採用相似的模型 • 列出一些已知和新的挑戰 • 實際的程式設計典範 • 相依性 • 資源管理 • 多核心環境下的應用原則
Dataflow Execution of Sequential Imperative Programs (2/6) • Dataflow on Multicores • 傳統之資料流機器(dataflow machines)是指令層級平行(ILP) • 在此將粒度提升至functions, 此已被用於task-level計算 • 可促進程式碼之重用 • 更適合於多核心之規模 • 藉由資料流之方式將function執行於各個core上 • 達成Function-LevelParallelism(FLP)
Dataflow Execution of Sequential Imperative Programs (3/6) • Dataflow on Multicores • 在執行時, 每次依序執行循序程式之一個function(而不是一個指令) • 在執行function前須先確認運算元 • 將運算元由個別之暫存器或記憶體位址擴展至object • 一個function之輸入運算元: read set • 其輸出運算元為: write set • 統稱為:data set • C++ STL
Dataflow Execution of Sequential Imperative Programs (4/6) • Dataflow on Multicores • 每個data set裡的object都有一個身分 • 用於建立各個function間之資料相依 • 判定是否目前要被執行之function與正在執行之任何function(s)是否相依 • 如果沒相依,submitted (or delegated) 至某一核心來執行 • 如果相依, 將該function shelved直至沒有相依 • 上述兩種情況, 皆會繼續執行下一個function(循序程式)
Dataflow Execution of Sequential Imperative Programs (5/6) • Handling Dependences • 使用token來處理資料流機制的相依性 • 採用類似一般資料流機制的技巧, 但有兩個關鍵的改進 • 一 • 傳統的是token與個別記憶體位址關聯取而代之的是把token與objects關聯, 來達到資料抽象 • 二 • 每一個object配置多個read tokens • 只有一個write token • 藉此管理資料之建立與使用
Dataflow Execution of Sequential Imperative Programs (6/6) • Handling Dependences • 當function以dataflowexecution執行時須 • Function會要求read(write) tokens給function裡object的read(write) set • 當function取得所需之tokens後, 則已準備好且可執行 • 一旦function執行完畢, 將會放棄所有tokens給shelved且需要該token的function • 當shelved function取得所需tokens後, 將unshelved且執行 • 此Model亦可以循序方式執行某種function • 即當在該function程式順序前的運算完成後才會執行該function • 且後續運算亦須等該function完成後才可執行
Dataflow Execution of Sequential Imperative Programs (1/7) • Model Overview - Example Figure 1. (a) Example pseudocode that invokes functions T and T’. T: {write set} {read set} modifies (reads) objects in its write set(read set). Data set of T’ is unknown. Figure 1. (b) Dynamic invocations of the functions T and T’ , in the program order, and the data set of each invocation.
Dataflow Execution of Sequential Imperative Programs (2/7) • Model Overview – Data dependence between the functions WAR RAW WAW A T3 D T5 T2 T1 T6 D B T4 B A Time (c) Dataflow graph of the dynamic function stream.
Dataflow Execution of Sequential Imperative Programs (3/7) • Model Overview – Execution of the code as per model– t1 A T1成功取得 read token 給 object A, and write tokens 給B &C >submitted for execution T1 T3 P1 B A B T2成功取得 read token 給 object A, and write token 給D > submitted for execution T2 T5 T2 T4 T’ P1 D D T6成功取得 read token 給 object H, and write token 給G > submitted for execution T6 T6 P1 Barrier Time t1 t3 t4 t2 Figure 1. (d) Dataflow execution schedule of the function stream
Dataflow Execution of Sequential Imperative Programs (4/7) • Model Overview – Execution of the code as per model – t2 A T1 T3 P1 B A T1 完成執行 > 釋出 write token B & C > T4 取得 write token B 但缺read token D B T2 T5 T2 T4 T’ P1 D D T6 T6 P1 Barrier Time t1 t3 t4 t2
Dataflow Execution of Sequential Imperative Programs (5/7) • Model Overview – Execution of the code as per model – t3 A T1 T3 P1 B A B T2 T5 T2 T4 T’ P1 D D T6 T6 P1 Barrier Time t1 t3 t4 t2 T2 執行完畢 > 釋出 write token D, and read token A > T3 取得 write token A, 開始運行 > T4 取得 read token D, 開始運行
Dataflow Execution of Sequential Imperative Programs (6/7) • Model Overview – Execution of the code as per model–t4 A T1 T3 P1 B A B T2 T5 T2 T4 T’ P1 D D T6 T6 P1 Barrier Time t1 t3 t4 t2 T4 執行完成 > 釋出 write token B, and read token D > T5 取得 write token B, start execution
Dataflow Execution of Sequential Imperative Programs (7/7) • Model Overview – Execution of the code as per model–after t4 A T1 T3 P1 B A B T2 T5 T2 T4 T’ P1 D D T6 T6 P1 Barrier Time t1 t3 t4 t2 T' will be submitted for execution after all previous functions complete.
Dataflow Execution of Sequential Imperative Programs (1/1) • Deadlock Avoidance • 如果二或多個functions建立環狀的tokens相依,token機制在一般的dataflowmodel將會有deadlock發生 • 例如, 調用T4與T5可能建立一個要求順序 • T4: 取得B → T5: 等待B → T5: 取得D → T4: 等待D > 導致deadlock • 避免token deadlocks 須確保: • (i) 某token一次僅能被一個function要求 • (ii) 依照function要求的順序將tokens給予object(先要先給) • 因此T5的token僅能在T4之後索取 WAR RAW WAW D T5 T2 T1 D B D T4 B B
Prototype Implementation (1/4) • 以C++ runtime library開發執行模型之軟體雛形 • Static Sequential Program • 為了完整表示命令式語言, 此模型允許程式在資料流與序列執行間切換 • 目前模型之程式是以C++編寫, 如同傳統的序列程式 • 特別地,使用者須知道 • functions間潛在地平行 • Objects在functions間之共享 • Read set 與 write set
Prototype Implementation (2/4) • Static Sequential Program • Dataflow Functions • Library 提供 df_execute 介面可供程式中之funciton平行執行 • df_execute是以C++ templates 實作之 runtimefunction • 非df_execute之指令將以特定的順序執行 • Shared data, • in the form of global, passed-by-reference objects or pointers to them, that are accessed by a function are passed to it as arguments. • Users group them into two sets, one that may be modified (write set) and another that is only read (read set). • The C++ STL-based set data structure of the token base class is used to create them. Figure 2. Example program in the proposed model.
Prototype Implementation (3/4) • Static Sequential Program • Serial Segments/Functions • 使用者可透過df_endinterface來返回序列執行 • df_endinterface相似於barrier, 使程式之執行脫離dataflow execution • 在df_endinterface前的程式執行完後, 往後的程式將以序列執行 • T’ in our example. Figure 2. Example program in the proposed model.
Prototype Implementation (4/4) • Static Sequential Program • Serial Segments/Functions • 為使共享之object在主程式裡依順序執行, 提供df_seqinterface • df_seq accepts the object instance, the function (object method) pointer and any arguments to it. • df_seq需等先前有使用該object者執行完畢, 才可執行, • 且後面之程式須等df_seq執行完才可繼續執行 • df_seqcauses the runtime to suspend the main program context until the associated function finishes operating on the specified object. • Execution will proceed from line 6 only after print finishes, • but potentially in parallel with other (prior) functions (that are not accessing G). Figure 2. Example program in the proposed model.
Runtime Mechanics (1/10) • 採用多執行緒來實作該機制, 其平行執行是以PthreadAPI實現 • 將執行緒管理抽象化, 讓使用者不會直接接觸 • 使用者不須了解機制之底層架構 • Executing Function on Processing Cores • At the start of a program, the runtime creates threads, usually one per hardware context available to it. • A double-ended work queue (deque) is then assigned to each thread in the system. • Computations are scheduled for execution by a thread by queuing them in the corresponding work deque.
Runtime Mechanics (2/10) • Discovering Functions for Parallel Execution • 剛開始執行時, 僅一個processor在等待工作抵達它的deque, 其他processors皆為閒置狀態 • 執行初期相似於序列執行 • 當遇到df_execute,runtime 將被啟動
Runtime Mechanics (3/10) • Discovering Functions for Parallel Execution • The runtime processes a dataflow function in three decoupled phases, • prelude • execute • postlude Figure 3. Logical view of runtime operations to process a dataflow function.
Runtime Mechanics (4/10) • Discovering Functions for Parallel Execution • In the prelude phase • Dereferences pointers to objects in the read/write sets, if need be, and attempts to acquire the tokens • Execute phase • Successful acquisition of tokens leads to the execute phase (Figure 3: 2), in which the function is delegated for (potentially parallel) execution • Specifically, the runtime pushes the program continuation (remainder of the program past the df_execute call) onto the thread's work deque, and executes the function on the same thread.
Runtime Mechanics (5/10) • Discovering Functions for Parallel Execution • A task-stealing scheduler • Running on each hardware context, will cause an idle processor to stealthe program continuation and continue its execution, until it encounters the next df_execute, repeating the process of delegation and pushing of the program continuation onto its work deque. • Thus the execution of the program unravels in parallel with executing functions, and possibly on different hardware contexts rather than on one hardware context.
Runtime Mechanics (6/10) • Tokens and Dependency Tracking • 在程式執行期間, • 被配置的object會有 • 一個 write token • 無限多的read tokens (limited only by the number of bits used to represent tokens), • 一個 wait list • Tokens are acquired for objects that the dataflow functions operate on • Released when the functions complete
Runtime Mechanics (7/10) • Tokens and Dependency Tracking • A token may be granted only if it is available. • Figure 4a gives the definition of availability of read and write tokens, • and Figure 4b shows the token acquisition protocol. • The wait list is used to track functions to which the token could not be granted at the time of their requests. • A non-empty wait list signifies pending requests, in the enlisted order. • An available token is not granted if an earlier function enqueued in the wait list is waiting to acquire it (Figure 4b: 1).
Runtime Mechanics (8/10) • Tokens and Dependency Tracking Figure 4. The token protocol: (b) Read/Write token acquisition Figure 4. The token protocol: (a) Definition of availability
Runtime Mechanics (9/10) • Shelving Functions/Program Continuations • 如果一個function 的tokens無法取得的話 • functionis enqueued in the wait lists of all the objects for which tokens could not be acquired (Figure 4b: 4 or 5), • and subsequently shelved (Figure 3: 1.2). • While the shelved function waits for the dependences to resolve, • the runtime looks for other independent work from the program continuation to perform Figure 3. Figure 4. The token protocol: (b) Read/Write token acquisition
Runtime Mechanics (10/10) • Completion of Function Execution Figure 3. Logical view of runtime operations to process a dataflow function. Figure 4. The token protocol:(c) Token release
Example Execution (1/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 0 E R = 0 F R = 0 G R = 0 CPU0 CPU1 CPU2 Figure 5. Example execution
Example Execution (2/15) H R = 0 A R = 1 B R = 0 C R = 0 D R = 0 E R = 0 F R = 0 G R = 0 W: T1 W: T1 CPU0 T1 CPU1 CPU2 Figure 5. Example execution
Example Execution (3/15) H R = 0 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 0 G R = 0 W: T1 W: T1 W: T2 CPU0 T1 steal execution CPU1 T2 CPU2 Figure 5. Example execution
Example Execution (4/15) H R = 0 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T2 W: T3 W: T1 W: T1 T3 CPU0 T1 CPU1 T2 steal execution CPU2 Figure 5. Example execution
Example Execution (5/15) H R = 0 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T2 W: T3 W: T1 W: T1 T4 T4 T3 CPU0 T1 CPU1 T2 CPU2 Figure 5. Example execution
Example Execution (6/15) H R = 0 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T2 W: T3 W: T1 W: T1 T5 T4 T5 T4 T3 CPU0 T1 CPU1 T2 CPU2 Figure 5. Example execution
Example Execution (7/15) H R = 1 A R = 2 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T6 W: T2 W: T3 W: T1 W: T1 T5 T4 T5 T4 T3 CPU0 T1 CPU1 T2 CPU2 T6 Figure 5. Example execution
Example Execution (8/15) H R = 1 A R = 1 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T6 W: T2 W: T3 W: T4 T4 T5 T5 T3 CPU0 Can’t execute CPU1 T2 CPU2 T6 Figure 5. Example execution
Example Execution (9/15) H R = 1 A R = 1 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T6 W: T2 W: T3 W: T4 T4 T5 T5 de_seq T3 CPU0 CPU1 T2 CPU2 T6 df_seqcauses the runtime to shelve the program continuation beyond df_seq in G’s wait list and await completion of all functions accessing G Figure 5. Example execution
Example Execution (10/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 0 E R = 0 F R = 1 G R = 0 W: T3 W: T4 T4 T5 T5 T3 CPU0 CPU1 CPU2 Figure 5. Example execution
Example Execution (11/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 2 E R = 0 F R = 1 G R = 0 W: T3 W: T3 W: T4 T5 CPU0 T3 CPU1 CPU2 Figure 5. Example execution
Example Execution (12/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 2 E R = 0 F R = 1 G R = 0 W: T3 W: T3 W: T4 T5 CPU0 T3 CPU1 T4 CPU2 T5 will be scheduled for execution once T4 completes Figure 5. Example execution
Example Execution (13/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 2 E R = 0 F R = 1 G R = 0 W: T3 W: T3 W: T4 T5 CPU0 T3 CPU1 T4 CPU2 G.print() After print completes, the runtime schedules the program continuation for execution Figure 5. Example execution
Example Execution (14/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 1 E R = 0 F R = 0 G R = 0 W: T5 CPU0 T5 CPU1 CPU2 The continuation is shelved again, thus preventing further processing of the program,until all in-flight functions finish Figure 5. Example execution
Example Execution (15/15) H R = 0 A R = 0 B R = 0 C R = 0 D R = 0 E R = 0 F R = 0 G R = 0 CPU0 T’ CPU1 CPU2 Figure 5. Example execution
Conclusion • Presented a novel execution model that achieves function-level parallel execution of statically-sequential imperative programs on multicore processors. • Parallel tasks (program functions) are dynamically extracted from a sequential program • and executed in a dataflow fashion on multiple processing cores using tokens associated with shared data objects, • and employing a token protocol to manage the dependences between tasks. • Thus combine the benefits of sequential programming and dataflow execution.