1 / 11

Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core Processor

Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core Processor. Takeshi Kodaka , Akira Takeda, Shunsuke Sasaki, Akira Yokosawa, Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui, Jun Tanabe, Takashi Miyamori and Nobu Matsumoto

huela
Download Presentation

Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalability Achievementby Low-Overhead, Transparent Threadson an Embedded Many-Core Processor Takeshi Kodaka, Akira Takeda, Shunsuke Sasaki, Akira Yokosawa, Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui, Jun Tanabe, Takashi Miyamori and Nobu Matsumoto Center for Semiconductor Research and Development Toshiba Corporation DAC2013 Designer/User Track

  2. Background • Requirements for embedded processors • Various types of processing • Video Codecs (HEVC, H.264,MPEG-2,WMV,...) • Face Detection/Recognition, Audio/Video playback, Mobile TV • Wide range of required processing performance • Should deal with various types of products from mobile phone to Tablets or more • Example: video decoding from QVGA 15fps to 1080p 60fps or more • Low cost and short time development that meets market requirement • Reuse existing software to reduce development cost

  3. Challenges • What kind of hardware architecture to employ? • The number of cores should be easily increased/decreased • How can we realize the scalable performance? • Parallelized application program that utilizes multiple cores efficiently • How can we realize the transparency? • Hiding the number of cores from application program Multiple Core Architecture[xu2012low] [xu2012low] A low power many-core SoC with two 32-core clusters connected by tree based NoC for multimedia applications, H. Xu, et al. VLSI Symposium 2012 Our Proposed Scheduler

  4. Our approach A simple multiple core architecture + An application program independent of # of cores + An efficient parallel processing scheme  Achieving Scalable performance

  5. Strategy to realize our approach • Strategy • Developing an application independent of # of cores transparency • Running the developed application on a multiple-core processor and achieving scalable performance proportional to # of coresscalable performance • Scheme • Designed an efficient thread scheduler • efficient management of threads may achievescalable performance • the number of cores may be hiddenif a thread scheduler abstracts the cores • Challenges • Minimizing overheads for execution • Hiding the number of cores from application program

  6. How to minimize overheads • Defined unique properties for threads • A Thread never suspends to wait for data • eliminate the overhead of thread switching • A Thread becomes ready to run when necessary data are all available • Managed a thread status using simple counters • Simplify the dependency into“the number of dependency“ • this can be realized by simple operations

  7. How to hide the number of cores fetch & execute • Designed a distributed scheduler with a shared queue • ONLY ready threads are placed in a shared queue • A Thread dispatcher runs on each core • The dispatcher fetches a thread from the shared queue and executes it • To reduce access conflict for a shared queue • We use CAS (Compare And Swap) instruction Core fetch & execute Thread Core ThreadDispatcher Thread Thread Thread Thread search Thread Thread Thread ThreadDispatcher Core search Thread ThreadDispatcher fetch & execute

  8. Implemented thread scheduler Core ThreadDispatcher 1 ・・ 3 ・・ 1 0 Thread Scheduler • Our Thread Scheduler consists of three components • Dependency Controller, Thread Pool, and Thread Dispatcher • Our Thread Scheduler ... • is low overhead forScalable Performance • hides the number of cores from application forTransparency Core Thread DependencyController ThreadDispatcher register necessary Core available core Thread Pool Appl. Thread Thread Thread Thread Thread Thread ThreadDispatcher Thread ready Thread fetch & execute Thread

  9. Designing a many-core processor Plan • Design goals for a many-core processor • Achieve scalable performance • Reuse existing software for a multi-core processor • a many-core processor has to execute existing software efficiently • knowledge of the software is absolutely necessary Software engineers and Hardware engineers collaborated closely to design a many-core processor • Design cycles • use “Plan – Evaluate – Analyze – Improve” cycle • existing software is used through out evaluation • At 1st cycle,: detect issues of existing architecture • At 2nd cycle, improve and optimize • Main design featuresfrom our development cycle • CAS instruction, multi-bank L2 cache, tree-based network on chip, Improve Evaluate using Simulation Analyze

  10. Evaluation results • Used SAME application binary even if the number of cores is changed These results confirms proposed thread scheduler achieves scalable performance with transparency! ScalablePerformance ScalablePerformance Lack of READY threads# of ready threads< # of MPEs H.264 Decoding 1080p Super resolution (full HD to 4K2K)

  11. Conclusions • Proposed a low-overhead thread scheduler • It achieves scalable performance and transparency • Reduces thread execution overheads • defined unique properties for a thread • A thread never suspends • A thread becomes ready when all necessary data are available • managed thread status by the number of dependencies • Hides the number of core • designed a distributed scheduler with a shared queue • Confirmed performance scalability and transparency • Evaluated on a real 32-core many-core processor • A scalable performance is achieved without modification of the application program Our scheduler contributesto the reduction of the software development cost

More Related