マルチコア /Multi-Core

マルチコア/Multi-Core • マルチコア化の背景Background of Multi-Core • CMOSトランジスタCMOS transistor • マルチコアプロセッサの一般的構成Typical structures of Multi-Core • 技術的課題点Technical issues • Cache一貫性（コヒーレンス）制御Cache coherence control • むずかしいパラレル化Parallelization 福永　力；ChikaraFukunaga

マルチコア化の背景Background of Multi-Core • トランジスタ微細構造化の限界Problems arisen from the fine structure process of transistors • Un-ignorable Increase of Leak current（漏れ電流）（coming from CMOS structure）→ Upper limit of Drive Frequency （動作周波数） • Core2 has made with CMOS of Gate length 45nm→22nm • 消費電力の限界Problems arisen from the power consumption • TDP：limit of Thermal Design Power（最大放熱量） with present drive freq. • Such a processor will no longer be adopted for mobile devices • Heat generation（発熱量） >> Heat radiation power（放熱量） • 単体プロセッサ設計の問題Problems arisen from the single core design • Limit of h/w design complexity （複雑化設計）beyond Super-scalar/pipeline • IPC will not be exceeded over four（IPC>4は無理か？）福永　力；Chikara Fukunaga

CMOS構造と原理CMOS structure • CMOS=Complementary Metal Oxide Semiconductor （相補的金属酸化膜半導体） • pMOSとnMOSで論理回路を構成Both pMOS and nMOS together makes logical circuits pMOS Metal wiring gate（poly-silicon） Guarde Insulator Oxygen Well Substrate nMOS Source Source Drain Drain 福永　力；Chikara Fukunaga

日経エレクトロニクス（Nikkei Electronics）2004.8.30 マルチコア化/Towards Multi-Core Parallel & freq. 30% lower Multi-core • デザインルールが小さくなり多数のコアを1チップに組み込める．Many cores can be put into a chip with lower design rule. • マルチコアで性能向上を今までと同じように維持できる．Maintains performance upgrade with Multi-Core • 単体に求められる演算性能は1/（コアの数）と低く抑えられる．Performance requirement for a core =1/number of cores • 低電圧電源で低消費電力Lower driving power and lower power consumption • プロセスに余裕を持たせられる．例えばゲート酸化膜を厚くしリークカレントの低減をはかることができる．Sufficient space for a transistor (thick gate → low leak current) Parallel & freq. 50% lower Processor performance Single core Speed up with design rule→Design rule helps no speed up→ Design rule (gate width) Pentium 4 180nm （2000） Pentium D 90nm（2005）福永　力；Chikara Fukunaga

マルチコア実装技術Issues of Multi-Core Implementation • 利点ばかりではなく技術的に注意すべき問題も山積している．Many issues for Multi-core designs beside various advantages • マルチコア対応プログラミングについても課題が多くあるSoftware technology (parallelization) for Multi-Cores is still problematic Original: 日経エレクトロニクス（Nikkei Electronics）2004年8月30日福永　力；Chikara Fukunaga

マルチコアの構成例Multi-Core configuration • 共有バス結合Common bus coupling type • 集中共有メモリ方式Shared memory type • 分散メモリ方式Distributed memory type • 相互結合ネットワークMutual coupling network • 例えばTPcoreのネットワークTpcore is a Flagship processor developed by Fukunaga’s lab. since 2005Tpcoreとは福永研のフラッグシッププロセッサ；2005 福永　力；Chikara Fukunaga

共有バス結合（1）Shared Bus coupling (1) • 集中共有メモリ方式Shared Memory type • データの共有によるプログラミングの容易さRelatively easier programming due to shared data handling • バスの負荷増加によるスケジューリングとバス主導権の調停の困難さHeavy load of the shared bus and difficulty to control bus initiative among cores (Arbitration) • cacheのコヒーレンシ（各コア間，共有メモリのデータ一致度）Difficulty to maintain the cache coherency • もしMPU1…nが同種のプロセッサであれば、これを対称マルチプロセッサ（SMP）構成と呼ぶ．あるいはUMA（Uniform Memory Architecture）This is called Symmetric Multi-processor (SMP) Architecture if all the MPUs are homogeny or UMA (Uniform Memory Architecture) 福永　力；ChikaraFukunaga

共有バス結合（2）Shared bus coupling (2) • 分散メモリ方式Distributed Memory structure • 共有バスのアクセス競合を減らすTry to reduce access conflict with own memory space • プログラミングの負荷はやや増す．分散配置されているメモリは仮想的に統一されて扱う．Load of program will increase. Memory localized are treated as if a part of shared memory virtually. • Called also as NUMA（Non Uniform Memory Architecture) • 多くは共有メモリと分散メモリ方式の混合として存在する．Normally actual chips are realized as mixture of shared memory and distributed memory architectures 福永　力；Chikara Fukunaga

Multi-Coreバス構成例（1）Examples of Multi-Core Architecture (1) 26bit Address & 32bit Data bus • ルネサス/Renesas SH4（RISC） Multi-Core SH7786SH-4A Core×2（SMP or Anti-SMP configurable） • Local Memory & Shared Memory mixed architecture 533MHz External Memories 福永　力；Chikara Fukunaga

Multi-Coreバス構成例（2）Examples of Multi-Core Architecture (2) • CELL chip (IBM, Toshiba, Sony, Sony Computer Ent.; SCEI) • PowerPC Processor Element; PPE (main) (×1) • Synergetic Processor Element; SPE (sub) (×8) • Asymmetric Multi Processor (ASMP) configuration • EIB (Element Interconnect Bus) 128bit×4 福永　力；Chikara Fukunaga

CELL chip processor elements • PPE (64bit PowerPC) • For execution of OS or Application main • Control of External main memory, IO and SPE • In-order 2-way Super scalar, 2-way Multi-thred • SPE for Arithmetic calculation, multi-media • 128 bit SIMD type RISC, In-order 2 way 32 kB 32kB 512kB 256kB Local Memory for access of other SPE data 福永　力；Chikara Fukunaga

マルチコア下でのCache構成の問題点Cache problem with Multi-Core m1,m2はMS mのそれぞれのプロセッサでcacheコピーとする． Assume m1 and m2 as cache copies of m in MS. （1）MPU1はm1をaに変更（store），m2はどうすべきか？What action MPU2 should take for m2 if MPU1 write “a” on m1? （2）　MPUnが共有メモリからオリジナルmのアドレスをcacheに読み込みたいが（1）の後ではどれを参照すべきか？MPUn needs to refill m in MS into own cache. What it should do after (1)? （3）　MPUnがオリジナルmのアドレスへのwriteアクセスでcacheミスしたため直接（共有）メモり上で（ライトスルーなので）データを書き換えたい，どうするか？MPUn made cache miss at writing to original m (under the Write through mode), what should MPUn do to the original m? 福永　力；Chikara Fukunaga

Cache Coherency（一貫性） • プロセッサが任意のメモリ（共有or分散）をread accessして常に最新のデータが取得できることが必須．A processor should get always the newest data if it makes read access to memory (shared or distributed). • これはプロセッサh/wで保証されなければならない．This rule must be guarantied with the processor h/w. キャッシュ書き込み制御機構：Restoring rules for cache • Write BackがMulti-core cacheで通常採用される．共有バスに負荷かからない．Write Back cache architecture is normally used for Multi-cores in order to reduce load to the shared bus. • Cache R/W miss でのRefill時に/At refill for Cache R/W miss • Write Update：そのblockをキャッシュにもつすべてのプロセッサに対してupdateをリクエスト The “Update” request sent to processors which share the block. • Write Invalidate: そのblockをキャッシュにもつ全プロセッサにinvalidateリクエスト The “invalidate” request sent to all processors which share the block. 福永　力；Chikara Fukunaga

ディレクトリによるCache変更の連絡・確認Communication and confirmation with Directory system for cache coherency control • ディレクトリ方式（一元管理）Directory Control Method (unified control) • 各プロセッサは自分のmemory copyがどのプロセッサで共有されているか登録するtableをもつ．Each processor has a table which contains the the processor numbers with which the own memory block is shared copied. • もしあるプロセッサがblockを変更したらどのプロセッサにその変更を連絡すればよいか素早く確認できる．If a processor modified a block, the processor can quickly identify the processors to whom this status modification should send. • しかしこれは分散メモリ形式に有効．共有バス方式では次の方法snoopingによる分散管理が主に利用される．This directory method is mainly applied to the multi-core with completely distributed memory architecture. Snooping is used normally for shared bus type architecture 福永　力；Chikara Fukunaga

SnoopによるCache状態の確認Check of a cache block with snoop • ブロードキャスト・スヌープ/ Broadcast Snoop（Snoop=詮索，かぎ回り） • Coherent requestがr/w cacheミス時にバスを通してなされるCoherent request to all the processors via the shared bus at cache r/w miss. • どのプロセッサもcache snoopを行いリフィルされるblockがあるかないか、あればcleanかdirtyかチェックEvery processor makes cache snooping to check any block to be refilled is in the cache or is even clean or dirty if exists. • もしそのblock がdirtyで見つかったならそのデータをWrite backで返すべき．そのblockがオーナ状態となる．If the block is found but dirty, the data should be written back, the block is transited to owner state. • Cleanであればinvalidかshared stateとしておく．If the block is found but clean, the block is transited into shared or invalid state. 福永　力；Chikara Fukunaga

ストアイン（ライトバック）キャッシュの状態遷移（シングルコア）ストアイン（ライトバック）キャッシュの状態遷移（シングルコア） Cache with Direct Map Architecture Local cache Main Storage Dark blue cells dirty Right blue cells clean 福永　力；Chikara Fukunaga

Multi-core Cache状態遷移図による管理Management of Multi-core cache with state transition diagram • M(odified): Data mismatch btwn MS (Main Storage) and cache (dirty) the block not found in caches of other processors • S(hared): Data match btwn MS and cache (clean)the block found in caches of other processors • E(xclusive): Data match btwn MS and cache (clean)the block not found in caches of other processors • I(nvalid): state right after reset or one with command “invalidate” （no data available） • 0（wner or Owned): Data mismatch btwn MS and cache (dirty)the block found in caches of other processors new → new → 福永　力；Chikara Fukunaga

MSIプロトコルMSI protocol • あるメモリブロックがClean状態を他のプロセッサのキャッシュと共有している、していないを区別しない．Clean (Share) state is not distinguished with shared or not shared with cashes of other processors • Read/Write Missともにbus snoopが必要Bus snoop is necessary at both R/W miss • もしblock keepが自分のみで他のプロセッサのcacheにはないにもかかわらずbus snoopするので無駄なsnoop（バストラフィック）が発生する．Even if only this processor has a copy of block in cache, it asks always bus snoop with read cache miss. Many unnecessary snoop on shared bus. 福永　力；Chikara Fukunaga

MESIプロトコル • Clean状態を2つに分ける．Shared、ExclusiveほとんどがExclusiveだと想定される．その場合Read Miss時もsnoopせずBusに無駄なトラフィックを発生させない．Two states Shared and Exclusive for Clean state.No snoop at Read Miss to keep reduce bus traffic. • 多くのMulti-Coreで採用されている．PowerPC, Intel Core 2Many Multi-Cores uses this protocol presently. 福永　力；Chikara Fukunaga

同時マルチスレッド Review（Simultaneous Multi-Thread; SMT） • SMTはSingle core内でOSあるいは専用h/wが複数Thread実行を制御していた．OS or some specific h/w controls the multi-thread execution • SMTよりSuper scalarの有効利用が進み眠っている各種資源を同時に独立に実行させることができIPCが向上した．Effective usage of a super-scalar has been established by introduction of SMT, several independent resources can work in parallel for every purpose. • スレッドレベル並列化（TLP）の推進がさらなるSMTプロセッサの効率を高めると期待される．Development of Thread level parallelization technique will enhance the effectiveness of an SMT processor. 福永　力；Chikara Fukunaga

同時マルチスレッドからMulti-CoreMulti-Core from SMT • 元来マルチプロセス（タスク）システムはOSで制御され，複数プロセスで資源の取り合いなどを防ぐ技術が開発されてきた（スピンロック，セマフォア，CSP）．Originally multi-task process execution has been controlled under an OS, and developed technology to avoid conflict in multi-process environment is applied in OS (Spin rock, semaphore, CSP etc.) • この技術をOSレベルからハードウェアレベルに引き下げ，多くのスレッドを適切にマルチコアを構成するプロセッサに分散配置させて割当て最適化された並列処理環境を実現できるかどうかが課題This technology must be implemented in hardware of a Multi-Core system or individual core. There is an issue to make an optimized parallel processing system totally in h/w environment of MC • もちろんこの技術開発には古くて新しい課題である並列処理システムのさまざまな問題を解決していかなければならないWe need to solve various old and new problems inherent in parallel processing system for the above issue．福永　力；Chikara Fukunaga

山積する並列プログラム化への課題Many problems for parallel programming • Hotchips2006でのSun MicrosystemsのY.Lin氏のスライドより，彼が指摘したマルチスレッド並列処理プログラムのさまざまな課題．Y.Lin of Sun-Microsystems specified various issues to construct anMT program as → • 並列できるタスクをどう見いだすか，作りだすか • タスクのスレッド化への写像 • スケーラビリティをどのように達成するか．(English → photo) が議論されている．福永　力；Chikara Fukunaga

マルチコア /Multi-Core

マルチコア /Multi-Core

Presentation Transcript

Multi-Core Systems

Multi-Core Design Automation Challenges

Parallax Multi-core Propeller

Multi-core Programming

Multi-core Systems and Coherence Hierarchies

Elastic Computing

Multi-core architectures

STL on Limited Local Memory ( LLM) Multi-core Processors

Virtualization and Multi-core in the Applications Area

Multi-Core Architectures and Shared Resource Management Lecture 3: Interconnects

Lecture 2 (Mapping Applications to Multi-core Arch)

MT Internals (enabling multi-tenant SaaS “ in the cloud ” )

Variation Aware Application Scheduling in Multi-core Systems

Multi-Core Computing

Multi-Tasking Outsourcing Challenges – Core Banking Services

A Flexible Multi-Core Platform For Multi-Standard Video Applications

Para-Snort : A Multi-thread Snort on Multi-Core IA Platform

OCP Debug Socket for Multi-Core Debugging

Introduction to Parallel Processing with Multi-core Part I

Optimizing the Fast Fourier Transform on a Multi-core Architecture

The ROOT Project in the multi-core CPU era