120 likes | 287 Views
Jesper Larsson Träff Senior Principal Researcher NEC Europe. Ian Phillips Prof., Principal Staff Engineer ARM. Ben Juurlink Professor Delft University of Technology. Kari Tiensyrjä Senior Research Scientist VTT. FP6-2004-IST-4 FET Proactive Initiative ACA SUPERcomputing on a CHIP: SUPERCHIP
E N D
Jesper Larsson TräffSeniorPrincipal ResearcherNEC Europe Ian PhillipsProf.,Principal Staff EngineerARM Ben JuurlinkProfessorDelft University of Technology Kari TiensyrjäSenior Research ScientistVTT FP6-2004-IST-4 FET Proactive Initiative ACA SUPERcomputing on a CHIP: SUPERCHIP Proposal Number 26888
1. Paths to exploitation • FET project with potential for application breakthroughs in a 10+ years horizon • Industrial Partners (NEC, ARM, Intel) cover a wide spectrum of application domains and provide: • Steering of scientific and technological research • Transfer of knowledge and results to and interplay with company design groups • Proposition to standardization bodies, where relevant (B.3.6) • Active promotion of results (T6.1 and T6.2): • High-profile scientific and applied conferences and journals • Organization of workshops • PhD courses and summer schools, incorporation into advanced curricula • Links to NoE’s • WP6 (led by Intel): dissemination and exploitation (also: B.3.3, B.4.1.7, and B.8.2.6) • T6.3 for technology transfer • T6.4 for exploitation
2. Target applications • Wide range of applications with high computational requirements will be considered • WP4 will analyse and identify applications, and selected sample applications will be implemented as proof-of-concept • An initial set of applications considered: • Desktops and servers (versatility from high-performance/single-application to high-throughput application suites) • Streaming and DSP applications, e.g. video in bandwidth constrained active networks and embedded 3D graphics • Real-time speech recognition and videoconferencing • Database applications, string processing, geographical information processing • Mobile devices (energy-efficiency) • PDA, HDTV • Games, virtual reality • Supercomputer (high-performance) • Vectorised CFD Boltzmann automata • MPI-parallelised finite element methods • Quantum Chromodynamics
3. Leading contenders within the proposal • Objectives: to boost performance by 2-3 orders of magnitude (compared to same transistor count), exploit parallelism at all levels, realise easy-to-use strong model of computing, provide scalability/wide application area/power saving techniques
3. Leading contenders within the proposal (cont) • Initial choice of architectures is partially guided by application requirements: • Eclipse and XMT: general purpose computing, embedded computing • Advanced CMP: high-throughput desktop and server machines • TTA/PISMA: streaming/DSP • TRIPS: HPC, streaming/DSP, threaded servers • Procedure to choose the initial SUPERCHIP architecture: 1. Develop an architecture evaluation framework (T1.1) 2. Develop semi-analytical power/performance/cost models (T5.1) 3. Develop/modify existing simulators for the architectures (T5.2) 4. Design benchmark programs for the architectures (T4.1) 5. Perform evaluation + identify strong/weak points + select (T1.1) • Preliminary criteria: • Power, performance, cost (silicon area) • Estimated scalability, PRAM-like model support, ease of programming • Estimated coverage for aimed application area, TLP-ILP co-exploitation • Potential for solving the rest of the problems
4. Ensuring HW implementation technologies impact on choice of scalable architecture • Scalability issues are observed in initial selection of candidate architectures • Mesh-like topologies (providing constant wire length links): Eclipse, CMP, TTA, TRIPS • Regular structures: Eclipse, CMP, TTA, TRIPS • No forwarding networks (Eclipse) or multistage forwarding networks (TRIPS) • No cache coherency mechanisms: Eclipse • Multithreading: Eclipse, XMT • Decentralized structure: Eclipse, CMP, TTA, TRIPS • Semi-analytical modeling of the architectures and candidate techniques (T5.1) • Analytical parametric power/performance/cost estimation models • Hardware implementation parameters are extracted from • Technology roadmaps e.g. ITRS • Pragmatic experience and knowledge of industrial partners
4. Ensuring HW implementation technology impact on our choice of scalable architecture (cont) • Architectural simulation (T5.2) • Develop/modify existing simulators • Benchmarks • Sample applications • Information on execution time, resource utilization and power consumption is extracted • Modeling of the critical parts of architectures • Feasibility analysis of candidate architectures • Studies on fault tolerance, clocking schemes, on-chip/off-chip communication, power saving and other implementation related issues for the SUPERCHIP architecture (T5.3) • Detailed modeling and feasibility assessment of critical parts of the SUPERCHIP architecture (T5.4)
5. Evolvement of the PRAM model for the candidate architectures • For ease-of-programming the SUPERCHIP programming model will be based on a PRAM-like model, considering • Relaxed synchronization (BSP-like) • Strong memory semantics (CRCW-like, built-in operators) • Potential for locality exploitation (memory, Hierarchical-PRAM) • Architectural requirements: • Synchronization: implicit after each instruction • Bandwidth: high bisection to handle random communication • Latency: communication/memory access latency should be hidden • SUPERCHIP will develop the necessary architectural support for this model • SUPERCHIP will not investigate PRAM-implementation on distributed memory architectures in general • Long-term research issue: Evolution of programming model and architecture to SUPERCHIP constellations
5. Evolvement of the PRAM model for the candidate architectures (cont)
6. Validation and assessment of the performance scalability of the final choice of HW/SW architecture • Analytically through parametric power/performance/cost models • Empirically through simulations • Benchmark kernels and sample applications • Scalable benchmark suite for fine-grained shared memory architecture • Standard benchmark suites • Sample applications • Parametric architecture simulations • By comparing to future alternative approaches (e.g. advanced CMPs) and theoretical machines (e.g. ideal PRAM) using the applications and benchmarks
7. Plan for identifying the requirements for the OS within the resources of the work plan • Goal is to identify requirements and implement core OS services to demonstrate validity of the architectural approach, but not to develop full-fledged OS (as stated in B.4.1.5): • Requirements from underlying architecture and applications • Resource management (process, thread and memory) • Runtime functions and services for applications • Input for identifying requirements will come from several other tasks including T1.2, T1.3, T2.2 and T3.3 • OS is not in charge of supporting distributed shared memory • Certain OS functionality will be covered by compiler’s run-time system • Task leader of OS task (T4.3, ULM) has developed a distributed operating system (Plurix) which provides an excellent basis
7. Plan for identifying the requirements for the OS within the resources of the work plan (cont) • Preliminary anticipated OS requirements • Dynamic process/thread scheduling • Memory management (physical and virtual) • Synchronization including inter-process communication • Support for power management and IO • Definition • A coarse-grain functional model of OS will be developed and validated through simulation • Definition of API in SUPERCHIP language (or pseudo-language in the early phase) • Implementation • Using the SUPERCHIP language and compiler (from T2.2 and T3.3) • Testing with architecture simulation tools (from T5.2) Feasible with the allocated resources and partners