1 / 49

Decoupled Architectures for Complexity-Effective General Purpose Processors

Decoupled Architectures for Complexity-Effective General Purpose Processors. Ronny Krashinsky and Mike Sung 6.893 Term Project Presentation MIT Laboratory for Computer Science 12-7-2000. Motivation. out-of-order superscalar designs are inefficient and hard to scale

kiaria
Download Presentation

Decoupled Architectures for Complexity-Effective General Purpose Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decoupled Architectures for Complexity-Effective General Purpose Processors Ronny Krashinsky and Mike Sung 6.893 Term Project Presentation MIT Laboratory for Computer Science 12-7-2000

  2. Motivation • out-of-order superscalar designs are inefficient and hard to scale • decoupled architectures can provide latency hiding, dynamic scheduling, and ILP in a much more complexity-effective and scalable manner • in previous work, decoupled architectures have been investigated for scientific apps • superscalar architectures are used universally for general purpose computing requirements • why? superscalars provide more flexibility, and decoupled architectures break down when there is a loss of decoupling

  3. Proposal • use decoupled architectures for complexity-effective general purpose computing • multithreading can be used to hide loss of decoupling latency • potentially get the best out of both architectures by providing a superscalar processor with decoupled engines for complexity-effective streaming computations • we will present a survey of prior work and our proposed architectural innovations, unfortunately a lot of infrastructure (e.g. a compiler) is required for a more detailed investigation

  4. Decoupled Access/Execute Architecture • AP & EP process separate instruction streams • EP used for computation (floating point) • ILP • data values communicated via queues • slip – AP runs ahead of EP • memory latency hiding • dynamic scheduling • head of AEQ can be used as instruction operand in EP • blocks if data isn’t available • takes the place of register renaming • store addresses wait in WAQ until corresponding data arrives from EP • loads can bypass stores (check address) Decoupled Access/Execute Computer Architectures, Smith, 1982

  5. Decoupled Access/Execute Architecture • program control flow implemented with corresponding conditional branch in each stream • branch condition queues allow AP to hide branch latency from EP • loss of decoupling if AP depends on branch condition from EP • not discussed in early works • implemented in the Astronautics ZS-1 Processor • single interleaved instruction stream is split to feed instruction queues • control flow instruction executed in the splitter Decoupled Access/Execute Computer Architectures, Smith, 1982

  6. Simultaneous Multithreading with DAE • observation that functional unit latencies and true data dependencies in EP hinder performance • use SMT and thread level parallelism to better utilize functional units (same as with SMT in superscalars) • few threads are required • decoupling provides memory latency tolerance, SMT hides functional unit latencies The Synergy of Multithreading and Access/Execute Decoupling, Parcerisa and Gonzalez, 1998

  7. Decoupled Control/Access/Execute Architecture • further optimization: control decoupling • three instruction streams, dynamic slip • CP processes control flow graph, sends directives to AP and EP to execute basic blocks • limited control capabilities in AP and EP: loop count and predication • fetch engines fill queues with valid instructions • dynamic loop unrolling • control latency hidden (without speculation) • “stream units” • CU can operate in stand-alone mode • implemented as a 21064, ran the OS The Effectiveness of Decoupling, Bird et. al., 1993

  8. Decoupled Control/Access/Execute Architecture • loss of decoupling events cause breakdown The Performance of Decoupled Architectures, Parcerisa et. al., 1996

  9. Decoupled Control/Access/Execute Architecture

  10. Decoupled Control/Access/Execute Architecture

  11. Decoupled Control/Access/Execute Architecture

  12. Decoupled Control/Access/Execute Architecture

  13. Decoupled Control/Access/Execute Architecture

  14. Decoupled Control/Access/Execute Architecture LOD!

  15. Decoupled Control/Access/Execute Architecture

  16. Decoupled Control/Access/Execute Architecture

  17. Decoupled Control/Access/Execute Architecture

  18. Decoupled Control/Access/Execute Architecture

  19. Decoupled Control/Access/Execute Architecture

  20. Decoupled Control/Access/Execute Architecture

  21. Decoupled Control/Access/Execute Architecture

  22. Decoupled Control/Access/Execute Architecture

  23. Decoupled Control/Access/Execute Architecture

  24. Decoupled Control/Access/Execute Architecture

  25. Decoupled vs. Superscalar Architectures • Dynamic “out-of-order” execution with less complexity • Allows non-speculative instruction and data prefetching. We can shrink data structures like first level caches, potentially reducing critical paths as well as reducing power • Inherent long memory latency toleration – provides performance advantage for streaming applications, etc. where lack of locality mitigates performance advantages of caches • Simplified issue logic which can be implemented with small structures/queues (contrast with ROB/IW/bypass structures) • Better resource utilization by partioning between CP/AP/DP, processors can have specialized ISAs • Scalability – direct consequence of simplified logic • For superscalar processors, need to increase IW which does not scale (Palacharla/Agawal papers) • Decoupled machines alleviate centralized resource bottlenecks • Queue-based structure is amenable to tiled architectures with on-chip networks

  26. Decoupled Architectures for General Purpose Computing So why haven’t decoupled machines taken over the world? • Because superscalar architectures took over the world first • Primary drawback of decoupled architectures from LOD events - “twisty” C code can cause severe performance degradation • Inability for compilers to program effectively for separate instruction streams – lack of research/development in the area of programming/compiling analysis Wheel of Reincarnation: no such thing as a new idea… • If we can augment existing decoupled architectures to remove the effects of LOD events, we effectively have an architecture that can feasibly be used for general purpose computing • Leverage exiting ideas to augment decoupling – Multithreading and Auxiliary Processing

  27. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture • Multithreading hides latency of LOD events. • LOD events result in very long latencies (>100’s cycles) to reestablish decoupling • Motivation is to hide LOD events to prevent need to resynchronize • SMT hides functional unit latencies.

  28. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture Multithreading in Access/execute units: • Multiple contexts (IP/RF) for fast context-switching during LOD event • Interleaved SMT to hide horizontal as well as vertical waste within execute processor

  29. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture • With multithreading, utilization of CP/AP/EP by different threads is pipelined • analgous to instruction pipelining in a CPU datapath

  30. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  31. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  32. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  33. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  34. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  35. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture LOD!

  36. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  37. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  38. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  39. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  40. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  41. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  42. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  43. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  44. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  45. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  46. Instruction Memory Interface IFB IFB param param IFE IFE RF RF cond cond CP AP DP SAQ SDQ LDQ RD LAQ Data Memory Interface Multithreading on a DCAE Architecture

  47. Auxiliary Decoupled Access/Execute Streaming Units • Implement control processor as fully functional high-performance microprocessor. Compiler can avoid decoupling control intensive code. • When decoupling is possible (e.g. streaming computations), the decoupled access/execute engines provide a high-performance complexity-effective alternative. • Analogous to vector coprocessors or SIMD array coprocessors. Basic idea is to utilize specialized hardware when possible and have a fallback plan when Achilles heel is exposed.

  48. Extensions for Improved Performance • Wider issue access/execute processors • Speculative Multithreading • Control processor can spawn speculative threads when only a single thread of control is available • Miss-speculation detection can be performed by checking accessed memory addresses (in queues) for collisions • Kill speculative thread by simply flushing queues/context • Can merge concepts, with multithreaded decoupled execution under the auxiliary access/execute units paradigm. • Use decoupling/multithreading when possible, and fall back on high performance control processor otherwise • Tiled architectures: Extend decoupled architectures to scaleable multiprocessor systems such as RAW. • Queue-based structure is a good fit for encorporating communication from other tiles

  49. Summary • Decoupled architectures represent a complexity-effective and scalable way to provide dynamic scheduling, hide latency, and exploit ILP • To enable general purpose computation, we can augment decoupling with multithreading to hide the latency of LODs • By using decoupled access and execute units as auxillary processors, we can leverage the benefits of both decoupling for streaming computations, and out-of-order superscalars for control flow intensive computations

More Related