1 / 54

Faster unicores are still needed

Faster unicores are still needed. André Seznec INRIA/IRISA. DAL: Defying Amdahl ’ s Law. ERC advanced grant to A. Seznec (2011-2016) DAL objective: « Given that Amdahl ’ s Law is Forever  propose (impact) the microarchitecture of the 2020 General Purpose manycore ».

rianne
Download Presentation

Faster unicores are still needed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Faster unicores are still needed André Seznec INRIA/IRISA

  2. DAL: Defying Amdahl’s Law • ERC advanced grant to A. Seznec (2011-2016) DAL objective: « Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020 General Purpose manycore »

  3. Multicores are everywhere • Multicores in servers, desktop, laptops • 2-4-8-12 O-O-O cores • Multicores in smart phones, tablets • 2-4-(not that simple) cores • Manycores for niche markets • 48-80-100 simple cores • Tilera, Intel Phi

  4. Multicore/multithread for everyone • End-user : improved usage comfort • Can surf on the web and hear MP3 • Parallel performance for the masses? • Very few (scalable) mainstream // apps • Graphics • Niche market segments

  5. No parallel software bonanza in the near future • Inheritage of sequential legacy codes • Parallelism is not cost-effective for most apps • Sequential programming will remain dominant

  6. Inheritage of sequential legacy codes • Software is more resilient than hardware • Apps are surviving/evolving for years, often decades • Very few parallel apps now • Unlikely redevelopment of parallel apps from scratch • Computing intensive sections will be parallelized • But significant code sections will remain sequential

  7. Parallelism is not cost-effective for most apps • Why parallelism ? • Only for performance • But costly: • Difficult, man-time consuming, error prone • Poorly portable: functionality and performance

  8. Sequential programming will remain dominant • Just easier • The « Joe » programmer • Portability, maintenance, debug • + compiler to parallelize • + parallel libraries • + software components (developped by experts)

  9. Looking backwards

  10. 2002: The End of the Uniprocessor Road • Power and temperature walls: • Stopped the frequency increase • 2x transistors: 5 %? 10 % ? perf. (if any) economical logic : buy smaller chips ! IC industry needs to sell new (expensive) chips: Marketing: « You need hyperthreading, 2, 4, 8 cores »

  11. Marketing multicores to the masses2002- .. SMT Dual-core Quad-core SMT SMT GREAT !!

  12. And now ? The end user is not such a fool ..

  13. Following the trend: 2020 • Silicon area, power envelope • ≈ 100 Nehalem class cores or • ≈ 1,000 simple cores (VLIW, in-order superscalar)

  14. Amdahl’s Law seq. parallel “Cannot run faster than sequential part”

  15. OK, parallel applications do not scale • Our recent study on parallel application scaling: • In general: bp> -1 : sublinear scaling • Sometimes: bs > 0 : sequential part increases Execution time Input set Processor number

  16. But let us use a naive (overoptimistic) model • A parallel application: • Parallel section: can use 1000 processors • Sequential section: run on a single processor SEQ: constant fraction of sequential code linear speed-up

  17. Complex cores against simple cores • CC: 100 complex vs SC :1000 simple cores with complex 2X faster than simple if SEQ > 0.8 % then CC > SC

  18. And hybrid SC + CC ? CC_SC: • 50 complex • 500 simple if SEQ> 0.2% then CC_SC > SC

  19. And if .. • Use a huge amount of resource for a single core: 10X the area of the complex core 10X the power of the complex core  Use all the uniprocessor techniques • Very wide issue (8 – 16 ?), Ultimate frequency ( « heat and run »), Helper threads, Value prediction • Invent new techniques Ultra Complex cores

  20. DAL architecture proposition • Heterogeneous architecture: • A few ultra complex cores • to enable performance on sequential codes and/or critical sections • A « sea » of simple cores • for parallel sections

  21. For the naive model « DAL » : UC_SC 5 ultra complex cores + 500 simple cores • If SEQ > 0.13 % then « DAL » > SC • « DAL » always better than UC, CC, CC_SC

  22. Need for research on faster unicores • Silicon area is 2ndorder issue • can use the area of 10 complexcores • Power/energyis 2ndorder issue can use the power of 10 complexcores

  23. On going work:Revisiting Value Prediction with Arthur Pérais

  24. Value prediction ?Lipasti et al, Gabbay and Mendelson 1996 Basic idea: • Eliminate (some) true data dependencies through predicting instruction results +1 +2 +2 +3 +3 +3 +3 I0 I0 I1 I1 I3 I3 I4 I4 I5 I5

  25. Value Prediction: • Large body of research 96-02 • Quite efficient: • Surprisingly high number of predictable instructions • Not implemented so far: • High cost : is it still relevant now ? • High penalty on misp.: don’t lose all the benefit

  26. Last Value Predictor • Just predict the last produced value • Set Associative Table • Use confidence counters AnalogywithPC-basedbranchprediction

  27. Stride value predictor • Add last value + (last difference) PC + Analogywithstrideprefetcher, but alsowithlooppredictor

  28. Finite Context Method predictors Use history of the last values by the instruction PC Analogy with local history branch predictor

  29. And global value history branch • Just no sense ! • Need the history of the last instructions • Too late !! • But global branch history !?! • ITTAGE is the state-of-the-art indirect branch predictor !! • And it predicts values !

  30. ITTAGE h[0:L1] pc h[0:L3] pc pc h[0:L2] pc 32 32 1 32 1 32 1 =? =? =? 32 32 prediction VTAGE Tagless base Predictor • Longest matching component provides the prediction

  31. The repair issue on misprediction I0 I1 I3 I4 I5 misprediction

  32. Pipeline squash • Acts as on exception, branch misprediction • Very high penalty I0 I1 I3 I4 I5

  33. Selective replay • Cancel all dependent instructions, but save the others • Very complex to implement: • Unlimited dependence chains I0 I1 I3 I4 I5

  34. Critical path • Predicted value needed late in the pipeline: • Disptach time is sufficient • Except that:

  35. A FCM implementation issue PC Speculative Window Might be a critical path Must take the last local values

  36. Critical path on the stride value predictor Can be reused on the next cycle PC + Stride AND spec. last value must be high confidence Speculative Window

  37. Experiments • 8-way superscalar, deep pipeline • Use prediction only on high confidence • 3-bit counters + saturated • + reset

  38. Squashing

  39. Selective replay

  40. High confidence through probabilistic counters • Need for very high confidence: • 95 % accuracy unsufficient • >> 99 % needed TRADING ACCURACY AGAINST COVERAGE • Saturation with only very low probability • 1/32, 1/256

  41. Squashing

  42. And hybrids

  43. Current status • All value predictors amenable to very high confidence • No complex selective repair needed • No need for local value prediction • No complex critical path in the local value predictor

  44. On going work:Selective Prediction of Predicated Instructions with Nathanael Prémillieu

  45. Who cares about predicated instructions ? • CMOV in all ISA • ARM, Itanium : • All instructions are predicated out-of-orderexecution: just a nightmare

  46. The multiple definition problem Before renaming: Mapping Table I1: R1 R2, R3 (p) I2: R4 R1, R2 After renaming: I1: P1 P15, P22 (p) I2: P13  ???, P15

  47. Expansion/Serialization After renaming: I1a: P1 P15, P22 I1b: P27 (p) ? P1, P11 I2: P13  P27, P15 • Create an extra instruction • Force I1bI2 dependency

  48. Aggressive serialization I1: P18 (p) ? (op P15, P22) : P23 I2: P13  P18, P15 • No expansion, but an extra operand on I1: • complexity on register file, issue logic, bypass network • Force I1I2 dependency

  49. Predicting the predicates • branchhistory or branch+predicatehistory to predict the predicates • Eliminate multiple definitions • Predicatemispredictionsbecomebranchmispredictions

  50. Not that convincing !

More Related