540 likes | 670 Views
Faster unicores are still needed. André Seznec INRIA/IRISA. DAL: Defying Amdahl ’ s Law. ERC advanced grant to A. Seznec (2011-2016) DAL objective: « Given that Amdahl ’ s Law is Forever  propose (impact) the microarchitecture of the 2020 General Purpose manycore ».
E N D
Faster unicores are still needed André Seznec INRIA/IRISA
DAL: Defying Amdahl’s Law • ERC advanced grant to A. Seznec (2011-2016) DAL objective: « Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020 General Purpose manycore »
Multicores are everywhere • Multicores in servers, desktop, laptops • 2-4-8-12 O-O-O cores • Multicores in smart phones, tablets • 2-4-(not that simple) cores • Manycores for niche markets • 48-80-100 simple cores • Tilera, Intel Phi
Multicore/multithread for everyone • End-user : improved usage comfort • Can surf on the web and hear MP3 • Parallel performance for the masses? • Very few (scalable) mainstream // apps • Graphics • Niche market segments
No parallel software bonanza in the near future • Inheritage of sequential legacy codes • Parallelism is not cost-effective for most apps • Sequential programming will remain dominant
Inheritage of sequential legacy codes • Software is more resilient than hardware • Apps are surviving/evolving for years, often decades • Very few parallel apps now • Unlikely redevelopment of parallel apps from scratch • Computing intensive sections will be parallelized • But significant code sections will remain sequential
Parallelism is not cost-effective for most apps • Why parallelism ? • Only for performance • But costly: • Difficult, man-time consuming, error prone • Poorly portable: functionality and performance
Sequential programming will remain dominant • Just easier • The « Joe » programmer • Portability, maintenance, debug • + compiler to parallelize • + parallel libraries • + software components (developped by experts)
2002: The End of the Uniprocessor Road • Power and temperature walls: • Stopped the frequency increase • 2x transistors: 5 %? 10 % ? perf. (if any) economical logic : buy smaller chips ! IC industry needs to sell new (expensive) chips: Marketing: « You need hyperthreading, 2, 4, 8 cores »
Marketing multicores to the masses2002- .. SMT Dual-core Quad-core SMT SMT GREAT !!
And now ? The end user is not such a fool ..
Following the trend: 2020 • Silicon area, power envelope • ≈ 100 Nehalem class cores or • ≈ 1,000 simple cores (VLIW, in-order superscalar)
Amdahl’s Law seq. parallel “Cannot run faster than sequential part”
OK, parallel applications do not scale • Our recent study on parallel application scaling: • In general: bp> -1 : sublinear scaling • Sometimes: bs > 0 : sequential part increases Execution time Input set Processor number
But let us use a naive (overoptimistic) model • A parallel application: • Parallel section: can use 1000 processors • Sequential section: run on a single processor SEQ: constant fraction of sequential code linear speed-up
Complex cores against simple cores • CC: 100 complex vs SC :1000 simple cores with complex 2X faster than simple if SEQ > 0.8 % then CC > SC
And hybrid SC + CC ? CC_SC: • 50 complex • 500 simple if SEQ> 0.2% then CC_SC > SC
And if .. • Use a huge amount of resource for a single core: 10X the area of the complex core 10X the power of the complex core Use all the uniprocessor techniques • Very wide issue (8 – 16 ?), Ultimate frequency ( « heat and run »), Helper threads, Value prediction • Invent new techniques Ultra Complex cores
DAL architecture proposition • Heterogeneous architecture: • A few ultra complex cores • to enable performance on sequential codes and/or critical sections • A « sea » of simple cores • for parallel sections
For the naive model « DAL » : UC_SC 5 ultra complex cores + 500 simple cores • If SEQ > 0.13 % then « DAL » > SC • « DAL » always better than UC, CC, CC_SC
Need for research on faster unicores • Silicon area is 2ndorder issue • can use the area of 10 complexcores • Power/energyis 2ndorder issue can use the power of 10 complexcores
On going work:Revisiting Value Prediction with Arthur Pérais
Value prediction ?Lipasti et al, Gabbay and Mendelson 1996 Basic idea: • Eliminate (some) true data dependencies through predicting instruction results +1 +2 +2 +3 +3 +3 +3 I0 I0 I1 I1 I3 I3 I4 I4 I5 I5
Value Prediction: • Large body of research 96-02 • Quite efficient: • Surprisingly high number of predictable instructions • Not implemented so far: • High cost : is it still relevant now ? • High penalty on misp.: don’t lose all the benefit
Last Value Predictor • Just predict the last produced value • Set Associative Table • Use confidence counters AnalogywithPC-basedbranchprediction
Stride value predictor • Add last value + (last difference) PC + Analogywithstrideprefetcher, but alsowithlooppredictor
Finite Context Method predictors Use history of the last values by the instruction PC Analogy with local history branch predictor
And global value history branch • Just no sense ! • Need the history of the last instructions • Too late !! • But global branch history !?! • ITTAGE is the state-of-the-art indirect branch predictor !! • And it predicts values !
ITTAGE h[0:L1] pc h[0:L3] pc pc h[0:L2] pc 32 32 1 32 1 32 1 =? =? =? 32 32 prediction VTAGE Tagless base Predictor • Longest matching component provides the prediction
The repair issue on misprediction I0 I1 I3 I4 I5 misprediction
Pipeline squash • Acts as on exception, branch misprediction • Very high penalty I0 I1 I3 I4 I5
Selective replay • Cancel all dependent instructions, but save the others • Very complex to implement: • Unlimited dependence chains I0 I1 I3 I4 I5
Critical path • Predicted value needed late in the pipeline: • Disptach time is sufficient • Except that:
A FCM implementation issue PC Speculative Window Might be a critical path Must take the last local values
Critical path on the stride value predictor Can be reused on the next cycle PC + Stride AND spec. last value must be high confidence Speculative Window
Experiments • 8-way superscalar, deep pipeline • Use prediction only on high confidence • 3-bit counters + saturated • + reset
High confidence through probabilistic counters • Need for very high confidence: • 95 % accuracy unsufficient • >> 99 % needed TRADING ACCURACY AGAINST COVERAGE • Saturation with only very low probability • 1/32, 1/256
Current status • All value predictors amenable to very high confidence • No complex selective repair needed • No need for local value prediction • No complex critical path in the local value predictor
On going work:Selective Prediction of Predicated Instructions with Nathanael Prémillieu
Who cares about predicated instructions ? • CMOV in all ISA • ARM, Itanium : • All instructions are predicated out-of-orderexecution: just a nightmare
The multiple definition problem Before renaming: Mapping Table I1: R1 R2, R3 (p) I2: R4 R1, R2 After renaming: I1: P1 P15, P22 (p) I2: P13 ???, P15
Expansion/Serialization After renaming: I1a: P1 P15, P22 I1b: P27 (p) ? P1, P11 I2: P13 P27, P15 • Create an extra instruction • Force I1bI2 dependency
Aggressive serialization I1: P18 (p) ? (op P15, P22) : P23 I2: P13 P18, P15 • No expansion, but an extra operand on I1: • complexity on register file, issue logic, bypass network • Force I1I2 dependency
Predicting the predicates • branchhistory or branch+predicatehistory to predict the predicates • Eliminate multiple definitions • Predicatemispredictionsbecomebranchmispredictions