1 / 22

Motivation

Demand and Penalty-Based Resource Allocation for Reconfigurable Systems with Runtime Partitioning John Ardini. Motivation. For a given HW architecture, including reconfigurable components Optimize performance in consideration of long reconfiguration times and current demands for processing

cassia
Download Presentation

Motivation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Demand and Penalty-Based Resource Allocation for Reconfigurable Systems with Runtime PartitioningJohn Ardini Ardini

  2. Motivation • For a given HW architecture, including reconfigurable components • Optimize performance in consideration of long reconfiguration times and current demands for processing • Application in systems with unknown runtime processing demands • Cognitive systems • Multisensor systems • Systems with unknown data lengths • Take advantage of ability to express hardware implementations in high-level language (C) common to processor and programmable devices Ardini

  3. Related Work • Li, Compton, Hauck [00], based on Young[94] • “credit” for RC unit is proportional to size of the unit • Penalty Algorithm for defragmentation • Scoring approach here, but “credit” is proportional to amount of “acceleration” achieved with decision threshold based on size • Vuletić, Pozzi, Ienne [04] • HW/SW abstraction layer proposed for RC transparent programming model Ardini

  4. Goals • Examine possible RTL generators allowing one set of source code for an algorithm • Binds to processor or programmable device (FPGA) • Minimal changes (I/O only) required to source • However, scheduling approach is not dependent on the capability or C to RTL generators • Show easy creation of processor and FPGA implementations of logic • Assume task scheduling is unknown at build time and is based on service requests • Allow each task to support SW only and hardware accelerated versions • Define simple logic to make “best” use of hardware resources, assign ownership dynamically • Show benefit of RC via DMA in algorithms that can be bound to HW or SW • Define API for application threads • Demonstrate concept in real hardware Ardini

  5. Experimental Environment Worker thd registration Mgr thread Service request • Worker thread, coproc DMA model setup in Windows using VC++ multithreaded app • Coprocessor is FPGA on PCI AlphaData card • Implemented algorithm execution with/without coproc • Used DMA to help hide overhead of reconfiguration: SW only threads can execute during configuration • Service requests initiated by adjustable timers to exercise RC logic • Event logging for analysis dataset dataset DMA config savings savings Worker thread 2 Worker thread 1 coproc Ardini

  6. Hardware Environment Local bus to PCI bridge, PC FPGA • Alpha-Data VirtexII Pro card on PCI bus • Simple bus wrapper gets coprocessor IP onto Alpha-Data local bus • PC chosen for easy development and focus on unique logic wrapper IP Ardini

  7. RTL Generator • ImpulseC chose for this study • ANSI C - like • Simple modifications to algorithm to compile for processor • Data I/O path • Word types as simple #defines • High level of abstraction • Small learning curve • Give up low-level control of registers/signals • Some control over max gate delay using #pragma • Desktop simulation for fast algorithm debug Ardini

  8. Software • Manager and application in VC++ • Easily implemented in C as well • For demo, windows “worker thread” model used, but other static thread + messaging methods could be used as well Ardini

  9. Test Algorithms • Two tasks implemented • FIR • FFT • HW implementation flow • Code in C • ImpulseC RTL generator • Synplify • Xilinx implementation tools • SW flow • Change I/O in HW algorithm to use shared memory buffer Ardini

  10. IP Development Outline • Write Task coprocessor for HW using ImpulseC • Modify I/O for processor implementation • Quantify savings in clock cycles for HW accelerated version • Wrap both implementations into “worker thread” that will use one of the implementations based on coprocessor ownership • Need to check coprocessor ownership on thread start • Worker thread registration not considered here • Could be defined on power up or • Dynamically registered Ardini

  11. Worker Thread Control Block • One instantiated per worker thread • Contains information about the coprocessor bit stream • Points to the HW resource it currently owns • Would be used in multiple coprocessor systems for faster manager logic • Contains base address of its coprocessor • Maintained by the manager and is used as a semaphore for coprocessor use Ardini

  12. RC Thread Control Block • Control block for HW resource • Holds information about the resource, e.g. the ID of the resource • Member function to kick off bit stream load process via DMA • Target thread can continue to run SW only until configuration is complete • Member function to gain coprocessor access on behalf of a worker thread based on ownership and state (is it done loading the bit stream?) Ardini

  13. Coprocessor Ownership • All service requests pass through the thread manager • Manager uses “Scoring” logic • Upon completion, worker threads report “savings” that were achieved, or, could have been achieved using a coprocessor • Manager increments score for that thread • Highest scoring threads receive a coprocessor • Reassignment not done until a threshold is passed • Set based on relative time penalty of performing a reconfiguration, e.g. do reconfig when score delta exceeds 10x the reconfiguration time. Ardini

  14. Scoring logic • Need to bound scores • Bound should be greater than RC threshold • 2x RC threshold used in these tests • Need to maintain “relative” performance of competing tasks, i.e. can’t have most scores saturating • Therefore, when updating scores at thread completion, subtract the current lowest score off of all registered threads Ardini

  15. Scoring Details • Simple subtraction of lowest score is not enough • One inactive thread would allow “integrator windup” on the remaining threads • Slow response when the inactive thread comes back online • Saturation logic would prevent the selection of coprocessor owners, i.e. they would all “collect” at the top of the score list • Prevents initial accumulation of scores • Therefore, subtract score x from each task where • x is the lowest nonzero score for all tasks other than the top scoring m threads where m is the number of available coprocessors Ardini

  16. Coproc Assignment • Get highest scoring non-owner in top m tasks • Compare score to lowest ranking owner • If diff is greater than threshold, RC • If current owner is using the resource skip RC • If RC is still the right decision after current owner finishes, RC will happen at that time • More logic could be used to continue comparing against current coproc owners Ranked task scores Top m tasks eligible for coprocessor ownership *t1 t2 Δ > thresh? t3 Lower ranking tasks will run in SW *t4 t5 * = current owner Ardini

  17. Reconfiguration Thread • Created by manager • Kicks off DMA process • Waits for done event • Sends reconfiguration complete message back to manager • Manager can then give access the Worker thread owner Ardini

  18. Test Configuration • Single HW resource available • Two competing threads, FFT, FIR processing • Fixed HW block sizes • Fixed data set sizes = fixed savings • Adjust for mismatch in microprocessor vs. FPGA clock rates • Service request rates for each thread adjustable to exercise RC logic Ardini

  19. Results score saturation RC Threshold hysteresis Thread 2 owns RC event Thread 1 owns No owner Service request rates for two threads vary with time Ardini

  20. Reconfiguration Detail RC DMA period Ardini

  21. RC DMA with Higher Demand Rate RC DMA period Ardini

  22. Conclusions • Coprocessor ownership given based on best sustained use of the resource • Provides hysteresis to prevent frequent reconfigurations • Low-overhead logic RC decision logic • Hardware and software implementations allow DMA to hide reconfiguration overhead • IP description in C allows it to be created once, compiled for microprocessor and FPGA targets Ardini

More Related