660 likes | 888 Views
3D-TSV 技術を組み込んで主流となるアプリケーションは何か How to make true 3D-TSV IC application. 明星大学 大塚寛治 Meisei University Collaborative Research Center Kanji Otsuka. 過去、システムの性能向上の足を引っ張った真の犯人はだれか?. 脳は計算素子であるニューロンは記憶素子も兼ねる、論理・メモリ共役システムである。 それに対して、コンピュータは論理素子とメモリが別々にあり論理素子の高速化にメモリ情報量の入出時間が遅れて論理素子の足かせになっていた。
E N D
3D-TSV技術を組み込んで主流となるアプリケーションは何かHow to make true 3D-TSV IC application 明星大学 大塚寛治 Meisei University Collaborative Research Center Kanji Otsuka
過去、システムの性能向上の足を引っ張った真の犯人はだれか?過去、システムの性能向上の足を引っ張った真の犯人はだれか? 脳は計算素子であるニューロンは記憶素子も兼ねる、論理・メモリ共役システムである。 それに対して、コンピュータは論理素子とメモリが別々にあり論理素子の高速化にメモリ情報量の入出時間が遅れて論理素子の足かせになっていた。 足かせになると解決策として論理のそばにメモリを置くキャッシュメモリシステムをはじめ、方式の複雑化が進展し、それを制御する論理がまた必要となってきた。あたかも本社機能が肥大化した大会社という図式である。 2
Intel also said this is big problem. 3 Intel Developer Forum 2003 Springより
プロセッサチップの処理能力とデータ取り込み速度のギャップが大きくなって、システム性能向上を阻害している。プロセッサチップの処理能力とデータ取り込み速度のギャップが大きくなって、システム性能向上を阻害している。 4 日経マイクロデバイセズ2009.4、pp17-21
1000000 100000 10000 1000 100 10 1 Multi-Threaded, Multi-Core Pentium4 and XeonTM Archtekuture with HT Multi-Treaded Pentium 4 Architecture Trace Cache 仕事単位並列時代 Pentium Pro Architecture MIPS Speculative Out of Order Pentium Architecture Super Scalar 命令並列処理時代 1980 1985 1990 1995 2000 2005 2010 Intel社アーキテクチャの変化 Intel Developer Forum 2003 Springより Intelのマイクロアーキテクチャの複雑化 ほとんどはバンド幅不足の対策 世界的に実装に金をかけることをしなかった結果としてのひずみである。 実装分野の地位があまりにも低かったため、声が経営陣まで届かないこと、虐げられ情報入手ができなかった実装技術者の無知の2点が原因であると断言する。 マイクロコード並列時代 5
More than Moorのキーテクノロジー: 3D-TSV 3D-TSVは3Dであること、上記バンド幅の拡大が図れそうなことで期待感あふれている。 3D-TSVがいまや世界的な認識となって、プロセス開発、設計手法の開発、検査手法の開発がなされているが、それを有効に使用できるアプリケーションがあるのかをここで検証する
We still not find major application with TSV interconnection structure. As our recognition, the main figure of merit on TSV structure is avoiding from the 2-D restriction provided by 3-D interconnections. Is the figure of merit collect? We should again check the concept of this main figure of merit toward making major applications. 7 Kanji Otsuka, Meisei University
TSV diameter: still very large for interconnection. Waste of active and 2D wiring area 2D interconnection Even if we chose the size of 2um dia. TSV Si substrate 8 Kanji Otsuka, Meisei University
TSVwould not get down with wiring limitation. TSV advantage is rather in 3D structure. TSV can provide approximately 2 more wiring layers prevented with wiring length prolong. Current technology: 6 or 10 metal layer Si substrate TSV 9 Kanji Otsuka, Meisei University
2. Trade-off issue between TSV aspect ratio and intrinsic gettering layer In case of Via-last Si substrate Loss of intrinsic gettering layer from when wafer thickness is 50um or less. TSV 10 Thinning edge IG Layer Kanji Otsuka, Meisei University
3. Difficult solving on Know-Good-Die issue at W2W, therefore needed redundancy implement Failure die 11 Kanji Otsuka, Meisei University
Si substrate TSV Si substrate integrated thermal energy TSV Si substrate TSV 4. Difficulty in thermal issue on many stacking structure, then saving power required 12 Kanji Otsuka, Meisei University
5. Effective function overcome cost issue 6. Other many restrictions under process and design technology: complexity increasing 13 13 Kanji Otsuka, Meisei University
Summary of 3D-TSV restrictions 14 Kanji Otsuka, Meisei University
Several solutions have been announced. Trend seems to be still not enough now. • Tile or small block array through TSV interconnection are good for memory or image sensor system with wide band interconnection by several thousand TSVs. • (2) Cache DRAM faces on CPU as providing large size cache with area saving. • (3) Stacked closed function block including FPGA and core makes to scalable system with redundancy. • (4) Using silicon interposer with TSVs gets higher performance of 2D wiring. • (5) Memory stacked module and many small core stacked module connect with diagnosis-restoration and dynamic reconfiguration wiring module. This is some of ideal system, however there is not any specified now. Redundant memory Core CPU or bus controller Memory FPGA Core Memory Diagnostic-restoration Many core FPGA Si interposer 15 Kanji Otsuka, Meisei University
世の中の例では、すべてタイルまたはブロック上アクティブエリアの周辺にTSV世の中の例では、すべてタイルまたはブロック上アクティブエリアの周辺にTSV Active area TSV and interconnection pad アクティブエリアにTSV配列はありえない 16
バンド幅を向上させるプロセッサ構造の一例 Active area Attached heat-sink TSV Via pitch: 20μm Many core processor Clock frequency: 500MHz Crossbar switch and cache memory Total stacked chip thickness up to 0.8mm DRAMs I/O interface Die size: 10mm LCR embedded interposer for higher frequency I/O signal and power distribution Maximum wiring length within lumped model handling up to 500MHz=15mm I/O interface chip handled up to 5GHz 17
メニーコアがマルチチップになった例 Active area Attached heat-sink Via pitch: 20μm TSV Many core processor Clock frequency: 500MHz Crossbar switch and cache memory Total stacked chip thickness up to 1.0mm DRAMs I/O interface Die size: 10mm LCR embedded interposer for higher frequency I/O signal and power distribution 18
チップレイアウトの寸法一例 TSV and interconnection pad Active area Via pitch: 20um Via diameter: 10um Number of vias: 10,000 Via shared area of 10mm square chip: 4.4% Ratio of signal pin vs power/ground: 5,000 vs 5,000 Clock frequency: 500MHz 信号のバンド幅はピン数と周波数とデータレートで次のように計算される。: (# of signal pins) x (clock frequency) x (data rate) = 5,000 x 500 x 109 x 2 =5Tbps Band width of current high end bus = up to 10GHz 19
インターポーザの断面構造一例 I/Oインターフェース 10mm LCRエンベッデッド分布定数配線と電源配線を有するインターポーザ 差動伝送線路 I/Oドライバより 特定チップ分配電源 差動信号伝送線路 LCRイコーライザ デカップリングキャパシタ 低インピーダンス電源・グランドペア伝送線路 ドライバ・レシーバ間差動伝送線路の直流抵抗2Ω以内、最大配線長400mm、500MHzではイコーライザ不要、2Ω以上かさらに高い周波数では(例えば6.4Gbps)イコーライザ必要です。このときのレシーバゲインは高感度に変更する必要があります。 電源・グランドペアの一チップ当たり特性インピーダンスは0.2Ω以下になるように、ビアを含めて配線引回しを行います。 20
5Tbpsはすごい 現状の最高バンド幅と比較すると、5Tbps/10Gbps=500 現状のコアの性能を10Gbpsに相当するものとすると500コアで構成されたシステムを作れる。 3D-SiP構造のみにこの解があることがわかる。 どんな装置にも使え強靭で柔軟なシステムが出来上がる。 KGDはそのシステムの冗長性から必ずしも必要でないが、KGDができたらもっと素晴らしい。 しかし、アーキテクチャの単純化ができない。 I/Oドライバのドライバビリティ削減で大幅に小電力となる。しかし、3Dで電力集中が起き、放熱問題は今のところ解決できない。 ここまで性能の高い用途は限定され、TSVの一般普及の救世主になるとは言えない。 簡単な方法論で簡単なシステムは出来ないか? 21
Small number of TSVs in each tile or small block would make most effective structure. However, different function of tile would have different size and different connection requirement. Therefore it could not produce to efficient stack-up and interconnection. Naturally, an idea can be created as unified circuit in whole of system. Then we can make the tile structure efficiently. Neuron of our brain is unified function conjugated with logical processing and memory. Can we make such circuit by CMOS unit gate? Neuron and axon network 22 Kanji Otsuka, Meisei University
Dynamic reconfiguration algorism by unified function block Array of mat Efficient communication between neighbor block with high band width and high processing rate Increasing and decreasing depend on cache hit ratio Adding cache by new generated logic Cache surrounded the logic Logic When job capacity increasing Cache surrounded the logic Expanding Logic Multi task with shared cache 23 Kanji Otsuka, Meisei University
Unified circuit! Easy to make as following configuration. SRAM can change to any function even wiring connection. For memory For logic Changed by mode selector 24 Kanji Otsuka, Meisei University
B5 B4 B3 B2 B1 Unified like algorithm is already current in FPGAs. FPGA ○Logic block: LUT (SRAM) and simple logic with relative small driver ○Switching block: FF+switch ○ Connecting block: wiring Above is not true unified block that is composed by primitive logic and additional memory (both are of hard structure) Toward unified circuit (before slide) ○ Logic block: SRAM with mode selector ○Memory block: SRAM with mode selector ○ Switching block: SRAM ○ basic cell connection (wiring): SRAM Unified ! However poor efficiency on switching block and wiring by SRAM Then, arrange optimum basic cell size and cluster size ○ Logic block: SRAM with mode selector with relative small driver ○Memory block: SRAM with mode selector with relative small driver ○ Cluster connection: bus with driver (through TSV) 0:off 1:on FF 0 1 0 0 0 0 Switching Block I/O COUT LUT architecture of Xilinx Virtex-5 BMUX B6 B 5-LUT Connecting Block Logic Block FF BQ FPGA’s Basic Cell MUX 5-LUT 25 6-LUT BX CIN Kanji Otsuka, Meisei University
Now I introduce our memory-logic conjugate system SRAM based 8bit Processor An application of Memory-Logic Conjugate System (MLCS) in Smallest model Meisei University Yoichi Sato Kanji Otsuka Hitachi ULSI Systems Masahiro Yoshida 26 Kanji Otsuka, Meisei University
The Outlook of the Memory - Logic Conjugate System(MLCS) 1. Solving the problem of low band width between memories and logics. (because of memory to be logic itself) 2. Effective architecture: dynamic reconfiguration can done by only rewriting register. (because of memory to be logic itself) 3. High speed operation: miscellaneous registers in a basic cell can be used by dynamic reconfiguration. (a basic cell itself can be programmable) 4. Suitable for 3D-TSV assembly and scalable made by small block configuration. 5. Low power: no need I/O circuits between Logic circuits and SRAMs. And access path can be saved. 27 Kanji Otsuka, Meisei University
Structure of Basic Cell Ch. set register Mode set register ADD(Write) D ADD Output control circuit (register, switch, etc Control) SRAM(LUT) 256W×8bit Input control circuit (mode change control & channel control) DIN CK CE R/W (4bit REG x8) Simple operation can be programmable by using rich internal registers. Bus wiring can be routing on the memory area (about 70%), which can save area. (4bit×2) (4bit×4) (4bit×2) Control bus(CY etc) :Outputs of Route Configuration register or Mode register (4bit×4) Sub control bus (8bit) :reconfiguration bus (4bit each) (4bit×4) :Control signal (1bit each) :address, data (4bit each) :write command bus (4bit×4) (4bit×4) (4bit×4) (4bit×4) 28 Kanji Otsuka, Meisei University
Operation mode of basic cell (Memory-logic conjugate cell) Rich operation modes can construct flexible and variable systems. Operation mode Through Accessmode (= initial mode) External memory mode Route Configuration Register Mode (making LUT) S/R=“L” (reset mode) External memory mode System mode Memory mode Route Configuration Register Mode (making LUT fordynamic reconfiguration) S/R=“H” Logic mode Arithmetic operation mode Logic library mode (Macro-cell) Combinational Circuit mode Internal memory mode For dynamic reconfiguration Information Update mode for Route Configuration Register Route Configuration mode by Mode Register 29 Kanji Otsuka, Meisei University
・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ Outlook of MLCS structure Some size of cluster allocation matches to operation and logic density. Other Systems (including Cluster memory) Multiple bus Clk + Control signal n columns Data( 8 bit×n ) Basic Cell Addresses CX Control Circuit +Bus I/F decoders m rows Basic Cell Array Basic Cell Array Cluster memory CY (address space of Cluster memory) q bit 8 bit Memory – Logic Conjugate System (MLCS): Total system including some Cluster memories Memory address of B.C. Extension address 30 Kanji Otsuka, Meisei University
Actual design of four basic cell configuration Area for TSVs Four basic cell Memory (SRAM) for testing 256W x 8bit x 4cell 31 Kanji Otsuka, Meisei University
Cluster memory layout example in single 8 bit ALU ●Area is about 330X330um2 @90nm process (One Cluster) PC Adder & 8bit ALUs (one resource shared) (decoder control) Logical judgment circuit Basic cell (Note) (1)Program counter:16bit .2-cycle operation in case of overflow in address operation .1-cycle operation (without overflow) (by using 8bit ALU) (2)structure of 8bit ALU .To enable 2-cycle 16bit addition, new type of adder with carry code input is introduced (which uses 4 Basic Cells). decoder 00 shifter(8bit) 01 10 11 Y 00 01 10 11 X Reserve part Instruction decoder Program memory (512w×8b) Basic cell array 32 32 Kanji Otsuka, Meisei University
Performance comparison between pure logic and MLCS Operation speed of processor mode Note: *Incase of 50% independency between four threads **One thread in pure logic that is superior than the SRAM based MLCS Program command + data Four multi-thread processing Rearrangement Power consumption on the same logic with one thread Area consumption on the same logic with different peripheral circuit : constant size with some allowance design : dynamic size with minimum design Pure logic would be the best for processing, however MLCS can operate dynamic reconfiguration mode and memory function. 33 Kanji Otsuka, Meisei University
Configuring from cluster to mat structure controlled by synchronous clock A mat (unit processor element) Cluster memory decoders decoders Basic Cell Array =Cluster Basic Cell Array =Cluster Position of clock supply Control Circuit Control Circuit decoders decoders Space for wiring and TSVs connecting between clusters in a mat Control Circuit Control Circuit decoders decoders Basic Cell Array =Cluster decoders decoders Basic Cell Array =Cluster 34 Kanji Otsuka, Meisei University
Clock timing image for synchronous and asynchronous Sub-Processor cluster Master clock ; asynchronous on mat-to-mat Dynamic access by asynchronous clock on mat-to-mat with dynamic reconfiguration Hit signal from neighbor mat by the header of a packet Clock synchronous cube, we said Mat 35 Kanji Otsuka, Meisei University
Dynamic reconfiguration algorism Array of mat Increasing and decreasing depend on cache hit ratio Adjacent addressing can save the latency within 1clock within synchronous cube Adding cache by new generated logic Cache surrounded the logic Logic When job capacity increasing Cache surrounded the logic Expanding Logic Of course, mat itself can dynamically set number of registers depend on requirement. Mat also can include penetrated caches inside. Multi task with shared cache 36 Kanji Otsuka, Meisei University
Other approach in technical papers. Memory structured LUT presented by Masayuki Sato, RECONF Symposium 2006.9 One idea introduce as half quadrate interconnection memory based logic circuit in random array, however still memories are consumed for interconnection / switching. Rearrangement of unit tile is developing now by Mr. Sato and Prof. Hironaka from Hiroshima City University. 37 Kanji Otsuka, Meisei University
Next significant issue is power saving.Is there drastic power saving method? Yes we have one idea. stop start Radiation of heat 38 Kanji Otsuka, Meisei University
Physics of power consumption CT Ron RI CL CI start stop Radiation of heat Voltage Current Power consumption on unit circuit On current RC Delay circuit 0 Current to waste Off current We should recover it. 39 Kanji Otsuka, Meisei University
Huge power!! Power supply building K computer, performance : 10PFLOPS, Largest computer in the world at now 40 Kanji Otsuka, Meisei University
G S D N-type N-type Active carriers on conduction band P-type One of solution can be found on electric motor car operation. Charge by brake Discharge Sports EV battery However, transistor can not recover the active carrier energy, we all would think. Is that true? 0V G Vacancy layer S D N-type N-type Diffusing and shifting to valence band association P-type Generating heat 41 Kanji Otsuka, Meisei University
Source Gate 4.3um Drain Space 1um 11.5um 2um Differential pair 7.2um 5um Recovering signal energy method: Active carrier reused on differential CMOS circuit Output characteristic impedance Z0=100Ω Differential MOS’s in the same well Input characteristic impedance Z0=100Ω Key structure is that differential MOS transistors are positioned in the same well. 42 Kanji Otsuka, Meisei University
Recovering signal energy method: Active carrier reused on differential CMOS I/O Driver VDD VDD VDD VDD INP OUTN INN OUTP VRF + + Arrangement differential transistors in the same well - Output ESD Input ESD Current control Inverter IN-Negative IN-Positive P P n+ P P N N p+ N N N-Well P-Well P_SUB 43 Kanji Otsuka, Meisei University
Unit cell ray-out configuration ESD Inverter ESD 45 Kanji Otsuka, Meisei University
Active carrier reused model Capacitance profile depending on bias in nMOS transistor 1 0.5 0V Transient inversion region
47 Kanji Otsuka, Meisei University
After inversion Transient Initial Forced releasing carrier by capacitance change Moving free carrier to other capacitance by voltage sink Paired switch in same well Discharge limiting inductance at carrier rejection through source or drain Set condition is as mobility of hole=4×102[cm2/Vs] at 300k in carrier density 1014~1015[cm-3], and Vdd=1.8V. Then drift speed D=7.2×102 [cm2/s] is counted. When carrier traveling length is 10μm, 0.001cm=√Dt=√2×102・t is derived, thus t=1.3×10-9s=1.3ns is given comparing with longer time for our object rise time of pulse 100ps (3GHz equivalent). But electron travel time is 130ps that is our order of rise time. 48 Kanji Otsuka, Meisei University
Carrier reuse driver chip 49 Kanji Otsuka, Meisei University
14 12 10 8 6 4 2 0 R for current measurement Flip chip bonding Terminator 100ohm Differential probing Z0=100ohm Substrate wiring length for differential output; 8mm Z0=100Ω Differential input Z0=100ohm 0.25mm length IC chip “0.18um node” conventional CMOS process Cip=0.47pF Cwel=1.56pF Cip=0.47pF Cin=0.45pF Power current measurement from the voltage drop at 4.7ohm series resistance. Cin=0.45pF We can save the power by carrier reused circuit. Differential inverter current depending on frequency Reduction!! 8 Current [mA] Current[mA] DC current by current control transistors and clumping drivers on others 6 Vdd Vdd Calculation current by cap. Calculation current by cap. 4 Depressed swing height region Depressed swing height region Ohmic current Ohmic current 2 Current at Vdd 1.8V Current at Vdd 1.8V 0 10 10 0.001 0.001 0.01 0.01 0.1 0.1 1 1 50 Frequency [GHz] Kanji Otsuka, Meisei University