290 likes | 402 Views
Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it. Reading and Writing data is a problem usually underestimated. However it can become crucial for: Performance Porting data on different platforms Parallel implementation of I/O algorithms. Performance.
E N D
Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it
Reading and Writing data is a problem usually underestimated. • However it can become crucial for: • Performance • Porting data on different platforms • Parallel implementation of I/O algorithms
Performance Time to access disk: approx 10-100 Mbyte/sec Time to access memory: approx 1-10 Gbyte/sec THEREFORE When reading/writing on disk a code is 100 times slower. Optimization is platform dependent. In general: write large amount of data in single shots
Performance Optimization is platform dependent. In general: write large amount of data in single shots For example: avoid looped read/write do i=1,N write (10) A(i) enddo Is VERY slow
Data portability • This is a subtle problem, which becomes crucial only after all… when you try to use data on different platforms. • For example: unformatted data written by a IBM system cannot be read by a Alpha station or by a Linux/MS Windows PC • There are two main problem: • Data representation • File structure
Data portability: number representation There are two different representations: Little Endian Byte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte0 Base Address+1 Byte1 Base Address+2 Byte2 Base Address+3 Byte3 Alpha, PC Big Endian Byte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte3 Base Address+1 Byte2 Base Address+2 Byte1 Base Address+3 Byte0 Unix (IBM, SGI, SUN…)
Data portability: File structure For performance reasons, Fortran organizes binary files in BLOCKS. Each block is identified by a proper bit sequence (usually 1 byte long) Unfortunately, each Fortran compiler has its own Block size and separators !!! Notice that this problem is typical of Fortran and does not affect C / C++
Data portability: Compiler solutions • Some compilers allows to overcome these problems with specific options • However this leads to • Spend a lot of time in re-configuring compilation on each different system • Have a less portable code (the results depending on the compiler)
Data portability: Compiler solutions For example, Alpha Fortran compiler allows to use Big-Endian data using the -convert big_endian option However this option is not present in any other compiler and, furthermore, data produced with this option are incompatible with the system that wrote them!!!
Fortran offers a possible solution both for the performance and for the portability problems with the DIRECT ACCESS files. Open(unit=10, file=‘datafile.bin’, form=‘unformatted, access=‘direct’, recl=N) The result is a binary file with no blocks and no control characters. Any Fortran compiler writes (and can read) it in THE SAME WAY Notice however that the endianism problem is still present… However the file is portable between any platform with the same endianism
Direct Access Files • The keyword recl is the basic quantum of written data. It is usually expressed in bytes (except Alpha which expresses it in words). • Example 1 • Real*4 x(100) • Inquire(IOLENGTH=IOL) x(1) • Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL) • Do i=1,100 • write(10,rec=i)x(i) • Enddo • Close (10) • Portable but not performing !!! • (Notice that, this is precisely the C fread-fwrite I/O)
Direct Access Files • Example 2 • Real*4 x(100) • Inquire(IOLENGTH=IOL) x • Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL) • write(10,rec=1)x • Close (10) • Portable and Performing !!!
Direct Access Files • Example 3 • Real*4 x(100),y(100),z(100) • Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100) • write(10,rec=1)x • write(10,rec=2)y • write(10,rec=3)z • Close (10) • The same result can be obtained as • Real*4 x(100),y(100),z(100) • Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100) • write(10,rec=2)y • write(10,rec=3)z • write(10,rec=1)x • Close (10) • Order is not important!!!
Parallel I/O • I/O is not a trivial issue in parallel • Example • Program Scrivi • Write(*,*)’ Hello World’ • End program Scrivi $ ./Scrivi Hello World Hello World Hello World Hello World Pe 0 Execute in parallel on 4 processors: Pe 1 Pe 2 Pe 3
Parallel I/O Goals: Improve the performance Ensure data consistency Avoid communication Usability
Parallel I/O Solution 1: Master-Slave Only 1 processor performs I/O Goals: Improve the performance: NO Ensure data consistency: YES Avoid communication: NO Usability: YES (but in general not portable) Pe 1 Data File Pe 0 Pe 2 Pe 3
Parallel I/O Solution 2: Distributed I/O All the processors read/writes their own files Goals: Improve the performance: YES(but be careful) Ensure data consistency: YES Avoid communication: YES Usability: NO Data File 1 Pe 1 Data File 2 Pe 2 Pe 3 Data File 3 Pe 0 Data File 0 Warning: Do not parametrize with processors!!!
Parallel I/O Solution 3: Distributed I/O on single file All the processors read/writes on a single ACCESS=DIRECT file Goals: Improve the performance: YESfor read, NO for write Ensure data consistency: NO Avoid communication: YES Usability: YES (portable !!!) Pe 1 Pe 2 Data File Pe 3 Pe 0
Parallel I/O Solution 4: MPI2 I/O MPI functions performs the I/O. These functions are not standards. Asyncronous I/O is supported Goals: Improve the performance: YES (strongly!!!) Ensure data consistency: NO Avoid communication: YES Usability: YES Pe 1 Pe 2 Data File Pe 3 Pe 0 MPI
Case Study Data analysis – case 1 How many clusters are there in the image ??? Cluster finding algorithm Input = the image Output = a number
Case Study Case 1- Parallel implementation Parallel Cluster finding algorithm Input = a fraction of the image Output = a number for each processor Pe 0 Pe 1 All the parallelism is in the setup of the input. Then all processors work independently !!!!
Case Study Case 1- Setup of the input Each processor reads its own part of the input file ! The image is NxN pixels, using 2 processors Real*4 array(N,N/2) Open (unit=10, file=“image.bin”,access=‘direct’,recl=4*N*N/2) Startrecord=mype+1 read(10,rec=Startrecord)array Call Sequential_Find_Cluster(array, N_cluster) Write(*,*)mype,’ found’, N_cluster, ‘ clusters’ Pe 0 Pe 1
Boundaries must be treated in a specific way Case Study ! The image is NxN pixels, using 2 processors Real*4 array(0:N+1,0:N/2+1) ! Set boundaries on the image side array(0,:) = 0.0 array(N+1,:)= 0.0 jside= mod(mype,2)*N/2+mod(mype,2) array(:,jside)=0.0 Open (unit=10, file=“image.bin”,access=‘direct’,recl=4*N) Do j=1,N/2 record=mype*N/2+j read(10,rec=record)array(:,j) Enddo If(mype.eq.0)then record=N/2+1 read(10,rec=record)array(:,N/2+1) else record=N/2-1 read(10,rec=record)array(:,0) endif Call Sequential_Find_Cluster(array, N_cluster) Write(*,*)mype,’ found’, N_cluster, ‘ clusters’ Case 1- Boundary conditions suggested Pe 0 Pe 1 avoid
Case Study Data analysis – case 2 From observed data… … …to the sky map
Each map pixel is meausered N times. The final value for each pixel is an “average” of all the corresponding measurements Case Study Data analysis – case 2 … values … map pixels id MAP
Case Study Case 2: parallelization • Values and ids are distributed between processors in the data input phase (just like case 1) • Calculation is performed independently by each processor • Each processor produce its own COMPLETE map (which is small and can be replicated) • The final map is the SUM OF ALL THE MAPS calculated by different processors
Case Study Case 2: parallelization ! N Data, M pixels, Npes processors (M << N) Real*8 value(N/Npes) Real*8 map(M) Integer id(N/Npes) Open(unit=10,file=‘data.bin’,access=‘direct’,recl=4*N/Npes) Open(unit=20,file=‘ids.bin’,access=‘direct’,recl=4*N/Npes) record=mype+1 Read(10,rec=record)value Read(20,rec=record)id Call Sequential_Calculate_Local_Map(value,id,map) Call BARRIER Call Calculate_Final_Map(map) Call Print_Final_Map(map) Define basic arrays Read data in parallel (boundaries are neglected) Calculate local maps Sincronize process Parallel calculation of the final map Print final map
Case Study Case 2: calculation of the final map Subroutine Calculate_Final_Map(map) Real*8 map(M) Real*8 map_aux(M) Do i=1,npes If(mype.eq.0)then call RECV(map_aux,i-1) map=map+map_aux Else if (mype.eq.i-1)then call SEND(map,0) Endif Call BARRIER enddo return Calculate final map processor by processor However MPI offers a MUCH BETTER solution (we will see it tomorrow)
Case Study Case 2: print the final map At this point ONLY processor 0 has the final map and can print it out Subroutine Print_Final_Map(map) Real*8 map(M) If(mype.eq.0)then do i=1,m write(*,*)i,map(i) enddo Endif return Only one processor writes the result