1 / 32

Development Support for Concurrent Threaded Pipelining

Development Support for Concurrent Threaded Pipelining. Core Research Laboratory Dr. Manish Vachharajani Brian Bushnell, Graham Price, Rohan Ambli University of Colorado at Boulder 2007.02.10. John Giacomoni. Outline. Problem Concurrent Threaded Pipelining High-rate Networking

titus
Download Presentation

Development Support for Concurrent Threaded Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Development Support for Concurrent Threaded Pipelining Core Research Laboratory Dr. Manish Vachharajani Brian Bushnell, Graham Price, Rohan Ambli University of Colorado at Boulder 2007.02.10 John Giacomoni

  2. Outline • Problem • Concurrent Threaded Pipelining • High-rate Networking • Communication • FastForward • Results • Additional Development Support

  3. Problem • UP performance at “end of life” • Chip-Multiprocessor systems • Individual cores less powerful than UP • Asymmetric and Heterogeneous • 10-100s of cores • How to program?

  4. PL Support • Programmers are: • Bad at explicitly parallel programming • FShm architectures make this easier • Better at sequential programming • Hide parallelism • Compilers • Sequential libraries? • Math, iteration, searching, and ??? routines

  5. Using Multi-Core • Task Parallelism • Desktop - easy • Data Parallelism • Web serving - “easy” • Sequential applications • HARD (data dependencies) • Ex: Video Decoding • Ex: Network Processing

  6. A Solution: Concurrent Threaded Pipelining • Arrange applications as pipelines • (Pipeline-parallel) • Each stage bound to a processor • Sequential data flow • Data Hazards are a problem • Software solution • Frame Shared Memory (FShm) • FastForward

  7. Concurrent Threaded Pipelining

  8. SequentialThreaded Pipelining

  9. Network Scenario • How do we protect? • GigE Network Properties: • 1,488,095 frames/sec • 672 ns/frame • Frame dependencies

  10. FShm • Frame Shared Memory • User programming environment • Linked Kernel and User-Space Threads • Input (Kernel) • Processing (User-Space) • Output (Kernel) • GigE frame forwarding (672ns/frame)

  11. FShmNetwork Pipelining Input (IP) Output(OP)

  12. Some Other Existing Work • Decoupled Software Pipelining • Automatic pipeline extraction • Modified IMPACT compiler • Assumes hardware queues • Stream-oriented languages • StreamIt, Cg, etc…

  13. AMD Opteron Structure

  14. Communicationis Critical for CTPs • Hardware modifications, expensive ($$$) • DSWP (<= 100 cycles) • Software communication • Serializing queues (locks) • >= ~ 600 ns (per operation) • How to forward faster ? • Concurrent Lock-Free Queues • Point-to-Point CLF queues (Lamport ‘83) • ~ 200 ns per operation • Good… Can we do better?

  15. FastForward • Portable software only framework • Architecturally tuned CLF queues • Works with all consistency models • Temporal slipping & prefetching to hide die-die communication ~35-40ns/queue operation • Core-core & die-die • Fits within DSWP’s performance envelope • Cross-domain communication • Kernel/Process/Thread

  16. ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } Optimized CLF Queues head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how head/tail cachelines will NOT ping-pong. BUT, “buf” will still cause the cachelines to ping-pong.

  17. Slip Timing

  18. Performance

  19. PerformanceFShm Forwarding

  20. AdditionalDevelopment Support • OS support • Identification and Visualization tools

  21. OS Support • Hardware virtualization • Asymmetric and heterogeneous cores • Cores may not share main memory (GPU) • Pipelined OS services • Pipelines may cross process domains • FShm • Each domain should keep its private memory • Protection • Need label for each pipeline • Co/gang-scheduling of pipelines

  22. Reported Results • http://www.cs.colorado.edu/~jgiacomo/publications.html • John Giacomoni and Manish Vachharajani, “Harnessing Chip-Multiprocessors with Concurrent Threaded Pipelines,” Technical Report CU-CS-1024-07, University of Colorado at Boulder, January 2007. • John Giacomoni, Manish Vachharajani and Tipp Moseley, “FastForward for Concurrent Threaded Pipelines,” Technical Report CU-CS-1023-07, University of Colorado at Boulder, January 2007. • John Giacomoni, John K. Bennett, Antonio Carzaniga, Manish Vachharajani and Alexander L. Wolf, “FShm: High-Rate Frame Manipulation in Kernel and User-Space,” Technical Report CU-CS-1015-06, University of Colorado at Boulder, October 2006.

  23. Questions?

  24. Intel Structure

  25. Evaluation Methodology • AMD Opteron 2GHZ • Dual-Processor & Dual-Core • Compute average time per call • TSC • Instrument performance counters • Memory Accesses • Serviced by L1 Data Cache • Serviced by L2 • Serviced by System (Coherence/Main)

  26. Performanceon/off die CPU (2&3) CPU (2&3) &1

  27. enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } CLF Queues

  28. enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } CLF Queues head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation

  29. enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } CLF Queues head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated?

  30. enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } new_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } Optimized CLF Queues

  31. new_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } Optimized CLF Queues head tail buf[0] buf[1] buf[2] buf[3] buf[4] buf[5] buf[6] buf[7] buf[ ] buf[ ] buf[ ] buf[n] Solution: Temporally slip stages by a cacheline. N:1 reduction in compulsory cache misses per stage.

  32. PerformanceFShm Forward

More Related