L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni Alessandro Lonardo I.N.F.N Roma - gruppo APE* alessandro.lonardo@roma1.infn.it *http://apegate.roma1.infn.it/APE Alessandro Lonardo - 18/12/06

Index • Machine Architecture • Software Areas • Programming Model • Languages • Example Applications • Development Tools • Execution Environment Alessandro Lonardo - 18/12/06

3D mesh of computing nodes Vertexes are processors Each proc hosts its local memory Each proc supports 64bit complex, vector2, double and integer types Edges are 3D torus network channels 6 bi-dir channels per proc Basic comm primitive is first-neighbour send-recv Processors synchronize on communications (send starts when recv is issued) apeNEXT ArchitectureThe Network Alessandro Lonardo - 18/12/06

apeNEXT ArchitectureThe J&T Processor Alessandro Lonardo - 18/12/06

Control Word Control Word 3 2 1 D1 D1 S1 Decoder&Scheduler D2 D2 D3 D3 apeNEXT ArchitectureVery Long Instruction Word Alessandro Lonardo - 18/12/06

FILU is FP,Integer and Logical unit: MAC op: A*B+C fully pipelined(1 result per cycle) ~ 12 cycles latency synthesizes to 200MHz 4 multipliers 4 adders 1.6GFlops on complex MAC apeNEXT ArchitectureThe J&T FILU Alessandro Lonardo - 18/12/06

apeNEXT SoftwareAreas • Architecture design, development and validation: simulators, no regression tools,… • Application development: compilation chain, libraries, profiler… • Execution environment: operating system, batch system, … • Applications. • System administration. Alessandro Lonardo - 18/12/06

apeNEXT SW development team • Average ~5 persons • People in all the collaboration sites • INFN Roma & Ferrara • Desy Zeuthen • Univ. Bielefeld • INRIA (France) Alessandro Lonardo - 18/12/06

apeNEXT Programming Model • Single Program Multiple Data: each node executes the same program, but on its own data. • synchronization barriers at global condition evaluations, with explicit statement or at I/O operations; • node to node synchronization at remote communications. • Nodes are connected by a 3D network, each node can efficently transfer data with its first, second and third neighbour. => well suited for homogeneous problems with short range interactions. Alessandro Lonardo - 18/12/06

apeNEXT Programming ModelData Decomposition(1) • Application discretized D-dim lattice domain decomposed onto a 3-dim processor mesh (maybe D != 3) => Each node has a subset of the lattice sites in its own memory • In other cases no decomposition is done, simulation is done in parallel without communications just to have a better statistics (FARM). Alessandro Lonardo - 18/12/06

apeNEXT Programming ModelData Decomposition(2) x 00 01 For each lattice site and on each node in parallel the program performs an “evolutionary step”. Short-range interactions => first neighbour inter-node communication. y 10 11 Alessandro Lonardo - 18/12/06

apeNEXT Programming ModelProgramming Languages(1) TAO dedicated parallel language • Fortran-like base syntax. • Dynamic Language: the (experienced) programmer can freely extend syntax with new statements, data types and operators => libraries configure the language for specific application domains (LQCD, Spin Glass, …) • Allows writing of high efficency codes by exposing the features of the hardware architecture (registers, prefetch queues, cache) Alessandro Lonardo - 18/12/06

apeNEXT Programming ModelProgramming Languages(2) C99 language • Few extensions to the standard language. • Eases the porting of applications and standard libraries. • Allows writing of high efficency codes by exposing the features of the hardware architecture (registers, prefetch queues, cache) Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Parallel Language Constructs(1) few parallel language constructs (same in C99 and TAO): • Conditioned execution on a subset of nodes based on local to node conditions (where) • Boolean operators for promotion of local to global conditions to be used in flow control statements (any, all, none). • Communications between nodes in the 3D mesh expressed as variable assignment, directions specified by mean of magic constants in the source address (X_PLUS, X_MINUS, …, Z_MINUS). Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Parallel Language Constructs(2) wherestatement - conditional execution on a mesh subset. where (x>=y) max_xy=x min_xy=y elsewhere max_xy=y min_xy=x endwhere Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Parallel Language Constructs(3) Inter-node communications: integer i real u[1024] register real rd1, rd2, rd3, rloc ... rd1 = u[i+Z_PLUS] rd2 =u[i+Y_PLUS+Z_PLUS] rloc = u[i+X_PLUS+X_MINUS] rd3 = u[i+X_PLUS+Y_MINUS+Z_PLUS] loads u[i] from node [x, y, z+1] into rd1 Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Parallel Language Constructs(4) any()/all()/none()boolean operators !!evaluation of mesh size along X !!with systolic algorithm sum_ix=1 sum_r[0]=node_abs_x sum_r[0]=sum_r[X_PLUS] !!internode communication while(any(sum_r[0]!=node_abs_x)) sum_ix=sum_ix+1 sum_r[0]=sum_r[X_PLUS] endwhile Alessandro Lonardo - 18/12/06

example 2D application kernel C function T datain[LVOL],dataout[LVOL]; // LVOL is node local volume // precalculate neighbourhood tables int neighp[LVOL,2], neighm[LVOL,2]; ... void kernel_fun() { register T res, d, dp0, dp1, dm0, dm1; for(i=0; i<LVOL; ++i) { // i is a linearized index d = datain[i]; // always local access dp0 = datain[neighp[i,0]];// local or remote access dp1 = datain[neighp[i,1]]; dm0 = datain[neighm[i,0]]; dm1 = datain[neighm[i,1]]; res = calc(d,dp0,dp1,dm0,dm1); // big & inline dataout[i] = res; } } Alessandro Lonardo - 18/12/06

example 2D application kerneldomain decomposition x domain decomposition of the datain[] array y datain[] local domain of nodexy datain[LVOL] boundary of local domain Alessandro Lonardo - 18/12/06

example 2D application kernelFirst Neighbour Systolic Communication x 10 00 y dm0 = datain[neighp[i,0]]; neighp[i,0]: local displacement +X_MINUS neighp[i,0]: local displacement 11 01 Alessandro Lonardo - 18/12/06

r =1 Example: Monte Carlo Pi Calculation • Estimate Pi by throwing darts at a unit square • Calculate percentage that fall in the unit circle • Area of square = r2 = 1 • Area of circle quadrant = ¼ * p r2 = p/4 • Randomly throw darts at x,y positions • If x2 + y2 < 1, then point is inside circle • Compute ratio: • # points inside / # points total • p = 4*ratio • Replicate the calculation on N nodes in parallel to have better statistics Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi Calculation C+OpenMP Code #include <stdio.h> #include <math.h> #include <stdlib.h> #include "omp.h" inline int hit(){ double x = (double) rand() / (double) RAND_MAX; double y = (double) rand() / (double) RAND_MAX; if ((x*x + y*y) <= 1.0) return(1); else return(0); } #define FIRST_SEED 3374 int main(int argc, char **argv) { int i, hits = 0, trials = 0; int seeds_index = 0; const int max_threads = omp_get_max_threads(); unsigned int seeds[max_threads]; double pi; printf("MAX_THREADS = %d\n", max_threads); if (argc != 2) trials = 1000000; else trials = atoi(argv[1]); srand(FIRST_SEED); for(i=0; i<max_threads; i++) /*scorrelo i seeds*/ { seeds[i] = rand(); printf("seed%d=%d\n",i, seeds[i]); } #pragma omp parallel private(i,seeds_index ) shared(seeds, hits, trials) { seeds_index = omp_get_thread_num(); srand(seeds[seeds_index]); #pragma omp for reduction(+:hits) for (i=0; i < trials; i++) hits += hit(); } pi = 4.0*(double)hits/(double)trials; printf("PI estimated to %.10g\n", pi); return 0; } Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi Calculation apeNEXT C Code #include <stdio.h> #include <math.h> #include <stdlib.h> #include <sysvars.h> #include <topology.h> inline int hit(){ double x = (double) rand() / (double) RAND_MAX; double y = (double) rand() / (double) RAND_MAX; if ((x*x + y*y) <= 1.0) return(1); else return(0); } int main(int argc, char **argv) { int i, hits = 0, trials = 0; int seeds_index = 0; const int max_threads = *_mem_imachine_size_x_p * *_mem_imachine_size_y_p * *_mem_imachine_size_z_p; const node_index = *_mem_inode_abs_id_p; unsigned int seeds[max_threads]; double pi; printf("MAX_THREADS = %d\n", max_threads); if (argc != 2) trials = 1000000; else trials = atoi(argv[1]); srand(FIRST_SEED); for(i=0; i<max_threads; i++) { seeds[i] = rand(); printf("seed%d=%d\n",i, seeds[i]); } srand(seeds[node_index]); for (i=0; i < trials; i++) hits += hit(); hits = global_sum(hits); trials *= max_threads; pi = 4.0*(double)hits/(double)trials; printf("PI estimated to %.10g\n", pi); return 0; } Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi CalculationResults – Intel P4 Dual Core lonardo@marlin>env OMP_NUM_THREADS=16 ./monte_pi-gcc.o MAX_THREADS = 16 seed0=1396293760 seed1=1488115307 seed2=1303873515 seed3=37393359 seed4=824846176 seed5=1138759395 seed6=1184683763 seed7=1884735975 seed8=443160774 seed9=326610858 seed10=878347714 seed11=501308535 seed12=1066424433 seed13=1420631951 seed14=391631339 seed15=1730610200 PI estimated to 3.14108 lonardo@marlin> Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi CalculationResults - apeNEXT Board (16 Nodes) lonardo@ant>nrun -hib -board 033 -minit0 monte-api.mem MAX_THREADS = 16 seed0=6556077425992558173 seed1=4923530068770806084 seed2=4637196908100545377 seed3=6221712952809700854 seed4=279065984179923185 seed5=7751953660738243840 seed6=7614450982016732205 seed7=1120288809807653798 seed8=4640801604175907269 seed9=4885633457180056444 seed10=905770433927994553 seed11=1598073754810041858 seed12=7232028785291230425 seed13=6726612558212505416 seed14=3567338195430110971 seed15=5194800804163472670 PI estimated to 3.13989775 lonardo@ant> Alessandro Lonardo - 18/12/06

rtc tao compiler Retargetable Tao Compiler: produce an intermediate pseudo-assembly file which is further translated into assembly for APEmille or apeNEXT. Based on Zz dynamic parser. Relies on a separate module for assembly code optimizations. Stable, production quality compiler apeNEXT Compilation Chain Alessandro Lonardo - 18/12/06

nlcc c compiler lcc 4.2 compiler port on apeNEXT architecture. few optimizations. c99 + apeNEXT syntax extensions Low bug reports rate. apeNEXT Compilation Chain Alessandro Lonardo - 18/12/06

ngcc c compiler Porting of GNU C compiler (GCC) for apeNEXT architecture Based on gcc version 4.1 Optimization passes performed on the compiler’s internal representation of code (tree-SSA, RTL) Source language: C99 and GNU Extensions to C99, apeNEXT extensions for parallel programming Possibility to integrate frontends to other source languages (C++, Fortran, TAO) Target language: apeNEXT user level assembly (SASM) apeNEXT Compilation Chain Alessandro Lonardo - 18/12/06

ngcc status: Single node C compiler – DONE Vector data types and arithmetics – ALMOST DONE Exploitation of native complex types and arithmetics – TO DO Remote memory accesses implementation – DONE Prefetch instructions – ALMOST DONE Cache handling – TO DO Where(), any(), all(), none() constructs – TO DO libc adaptation – JUST STARTED Work in progress apeNEXT Compilation Chain Alessandro Lonardo - 18/12/06

mpp macro-assembler translates a “user-friendly” assembly into a micro-assembly representation macro expansion. label analisys. emission of masm-instructions for cache handling. apeNEXT Compilation Chain Alessandro Lonardo - 18/12/06

sofan micro-assemby optimizer based on the salto (INRIA) optimization toolkit Transforms the micro-assembly code in order to perform a series of optimizations, such as: mul-add fusion Dead code removal Copy propagation Address generation optimization Intruction pre-scheduling apeNEXT Compilation Chain Alessandro Lonardo - 18/12/06

shaker microcode scheduler generation of optimized microcode to exploit the Pipelined Very Long Instruction Word Processor Architecture scheduling Register renaming Register allocation Microcode compression Optional generation of executable for the functional simulator apeNEXT Compilation Chain Alessandro Lonardo - 18/12/06

generation of microcode patterns, texec = tmax “shake up” phase: try to schedule each pattern earlier as possible respecting: dependencies between instructions device occupation at each cycle texec = tsu apeNEXT Compilation Chainshaker microcode scheduler DEVICES 0 0 1 1 1 1 1 1 3 2 2 3 shake up 2 2 3 3 2 2 4 4 2 2 4 5 4 CYLES 5 5 3 5 tsu 3 3 3 4 4 4 4 5 5 5 5 tmax Alessandro Lonardo - 18/12/06

“shake down” phase: try to schedule each pattern later as possible respecting: dependencies between instructions device occupation at each cycle texec = tsu- tsd Tipically tmax / texec ~ 10 in computing intensive code sections apeNEXT Compilation Chainshaker microcode scheduler DEVICES 0 1 1 tsd 1 3 2 1 1 shake down 2 3 1 3 2 2 3 3 2 2 3 CYCLES 4 4 2 3 3 2 4 5 4 5 5 5 5 4 4 5 tsu tsu 5 4 5 4 Alessandro Lonardo - 18/12/06

sf functional simulator micro-assemblyInstruction level simulator. Support for single and multinode simulations (1x1x1, 2x2x2, 4x2x2). Fast simulation (multithreaded) no cycle accurate. bit exact arithmetic (microcode scheduling may give differences). apeNEXT Compilation Chain Alessandro Lonardo - 18/12/06

apeNEXT Execution EnvironmentOS distributed architecture(1) • 7thLink: • Program loading • I/O operations • 1 channel per unit • 200 MB/s per channel I2C: bootstrap, exception handling, debugging (1.5 MB/s) Alessandro Lonardo - 18/12/06

apeNEXT Execution EnvironmentOS distributed architecture(2) • Master • resides on the front-end linux PC • user interface (shell commands) • Partitioning • dispatch I/O request to the slaves • Slave • Resides on the blade PCs • Handles communication with apeNEXT on I2C and 7thLink PCI boards • tiny kernel of routines embedded in the apeNEXT program • loader • I/O (routing of data to and from the interface node) • System services (time counters, etc) Alessandro Lonardo - 18/12/06

apeNEXT Execution Environment • programs can be loaded and executed on a machine partition: • node (1x1x1) • board = 16 nodes (4x2x2) • unit = 4 boards (4x2x8) • crate = 4 units (4x8x8) • rack = 2 crates (8x8x8) • Partition is reserved until the program execution finishes (no multitasking!) • Single process • No virtual memory Alessandro Lonardo - 18/12/06

Batch system • Torque/OpenPBS • today fifo-Scheduling, implementing a users group quota based scheduler. • queues: • rack • crate • unit Alessandro Lonardo - 18/12/06

Batch SystemJob Submission • nsub: wrapper of the qsub command Usage: nsub [OPTIONS] script Submits a apeNEXT job where OPTIONS are: -a date_time Declares the time after which the job is eligible for execution the format is: [[[[CC]YY]MM]DD]hhmm[.SS] -c conf chooses among available apeNEXT configurations conf=board|unit|unit[01][0-3]|crate|crate[01] |rack(default=crate) -m host_name requests a particular host -g group_name overrides user group -o logfile overrides logfile name -V dumps version information -v be verbose -h shows this help Alessandro Lonardo - 18/12/06

Batch SystemJob submission example lonardo@theboss>nsub -c crate 7h_test.sh 15942.theboss.ape lonardo@theboss>qstat -an1 theboss.ape: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- 15813.theboss.ape orifici crate stoc2 1826 1 -- -- 24:00 R 21:52 rack10/1 15860.theboss.ape simula crate run_tdilu 29170 1 -- -- 24:00 R 14:34 rack4/1 15877.theboss.ape zeidlew crate mu.056.cjo 5925 1 -- -- 24:00 R 11:38 rack8/0 15880.theboss.ape delia crate run.sh 6386 1 -- -- 24:00 R 10:59 rack8/1 15896.theboss.ape delia unit rum0175.sh 32291 1 -- -- 24:00 R 08:47 rack7/5 15900.theboss.ape frezzott rack RUN_Rack5. 18099 1 -- -- 24:00 R 07:56 rack5/0 15906.theboss.ape delia unit run0175.sh 1072 1 -- -- 24:00 R 06:49 rack7/0 15918.theboss.ape frezzott rack RUN_Rack2. 15890 1 -- -- 24:00 R 04:30 rack2/0 15926.theboss.ape simula crate run1_tdilu 4409 1 -- -- 24:00 R 02:47 rack4/0 15927.theboss.ape delia unit run0200.sh 3596 1 -- -- 24:00 R 02:47 rack7/4 15928.theboss.ape lacagnin crate run.5.7.sh 4772 1 -- -- 24:00 R 02:36 rack1/0 15930.theboss.ape lacagnin crate run.5.6.sh 2787 1 -- -- 24:00 R 02:34 rack9/0 15932.theboss.ape cosmai crate b5.450_n0. 2994 1 -- -- 24:00 R 02:02 rack9/1 15933.theboss.ape delia unit rum0200.sh 4216 1 -- -- 24:00 R 01:53 rack7/2 15934.theboss.ape delia unit run0225.sh 4552 1 -- -- 24:00 R 01:24 rack7/1 15935.theboss.ape devitiis crate theboss.sh 28504 1 -- -- 24:00 R 01:22 rack3/0 15936.theboss.ape orifici crate stoc 28592 1 -- -- 24:00 R 00:57 rack3/1 15937.theboss.ape devitiis crate theboss.sh 18065 1 -- -- 24:00 R 00:57 rack6/0 15939.theboss.ape cosmai crate b5.450_n1. 5845 1 -- -- 24:00 R 00:54 rack1/1 15940.theboss.ape devitiis crate theboss.sh 18472 1 -- -- 24:00 R 00:22 rack6/1 15941.theboss.ape delia unit rum0225.sh 5290 1 -- -- 24:00 R 00:14 rack7/3 15942.theboss.ape lonardo crate 7h_test.sh 11226 1 -- -- 24:00 R -- rack10/0 Alessandro Lonardo - 18/12/06

L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Presentation Transcript

Computer software