370 likes | 497 Views
Tuning DiFX2 for performance. Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS. Outline. I/O bottlenecks and solutions Communication with the real world (reading raw data, writing visibilities) Interprocess communication Keeping out of memory trouble
E N D
Tuning DiFX2 for performance Adam Deller ASTRON 6th DiFX workshop, CSIRO ATNF, Sydney AUS
Outline • I/O bottlenecks and solutions • Communication with the real world (reading raw data, writing visibilities) • Interprocess communication • Keeping out of memory trouble • Minimizing CPU load in various corners of parameter space • For more information and pictures:http://cira.ivec.org/dokuwiki/doku.php/difx/mpifxcorr/
Baseband data DataStream 1 Core 1 DataStream 2 Core 2 … … processing buffer processing buffer DataStream N Core M processing buffer Visibilities Timerange, destination Source data Master Node Visibility buffer Visibility buffer Visibility buffer Getting data into DiFX Large, segmented ring buffer
Getting data into DiFX • How to test? neutered_difx, with a small number of channels • Fundamental limit: native transfer speed (disk read, network pipe) • If this is the problem, buy a RAID or get infiniband or … • Potential troublemaker: CPU utilisation on datastream node (competition) • Can come from tsys estimation • Tweaking: datastream databuffer
Datastream databuffer /“Subint” Key parameters: dataBufferFactor nDataSegments subintNS Only real potential problem I/O-wise: buffer too short (databufferfactor)
Baseband data DataStream 1 Core 1 DataStream 2 Core 2 … … processing buffer processing buffer DataStream N Core M processing buffer Visibilities Timerange, destination Source data Master Node Visibility buffer Visibility buffer Visibility buffer Getting visibilities out of DiFX Large, segmented ring buffer To disk
Getting visibilities out of DiFX • FxManager writes the visibilities to disk • This is very rarely a problem unless you have a dying disk or very large and/or frequent visibility dumps • Testing: neutered_difx + fake data source (ensures good input speeds) • Tweaking: none • If you want to write out visibilities faster, put a fast disk (probably RAID) on the manager node!
Baseband data DataStream 1 Core 1 DataStream 2 Core 2 … … processing buffer processing buffer DataStream N Core M processing buffer Visibilities Timerange, destination Source data Master Node Visibility buffer Visibility buffer Visibility buffer Interprocess @ the Datastream Large, segmented ring buffer
Interprocess @ the Datastream • Generally not a problem • Tweaking: dataBufferFactor, ensure reasonable size (avoids latency issues) • Default (32) generally okbut couldusually bebigger w/oproblems(increasenSegmentsalso)
Baseband data DataStream 1 Core 1 DataStream 2 Core 2 … … processing buffer processing buffer DataStream N Core M processing buffer Visibilities Timerange, destination Source data Master Node Visibility buffer Visibility buffer Visibility buffer Interprocess @ the Core Large, segmented ring buffer • Tweaking: • subintNS • Output visibility size (nChan / nBaselines)
Interprocess @ the Core • In terms of reducing data transmission, increasing subintNS is the only real knob to turn • Unimportant for continuum, single phase centre - it’s only very high spectral resolution and/or multiphase centre where this is relevant • In those cases, bigger is better; but be careful about memory (later)
Baseband data DataStream 1 Core 1 DataStream 2 Core 2 … … processing buffer processing buffer DataStream N Core M processing buffer Visibilities Timerange, destination Source data Master Node Visibility buffer Visibility buffer Visibility buffer Interprocess @ the FxManager Large, segmented ring buffer The most common trouble point! Must aggregatedata from all Core nodes, can lead to high data rates
Interprocess @ the FxManager • To calculate the rate into FxManager, work out the rate for one Core node and scale • Tweaking: maximise subintNS!Or (although this is usually not possible) reduce visibility size (via nChan or the number of phase centers)
Memory @ the Datastream • Just don’t make the combination of dataBufferFactor and subintNS too big (can also control via “sendSize”)
Memory @ the Core • Usually the biggest problem, memory-wise
Memory @ the Core • Usually the biggest problem, memory-wise • Never used to be a problem, but multi-field center jobs hit hard • Bigger subint means more memory (storing datastream baseband) • More threads means more memory - at the pre-average spectral resolution • Buffering more FFTs costs more (x the number of threads, too!)
Memory @ the Core • Tweaking: • subintNS • nThreads (threads file) • numBufferedFFTs • And be aware of: • nFFTChans (for multiphase centre/high spectral resolution) • Number of phase centres
Memory @ the FxManager • Tweaking: visBufferLength • Multiplies the size of a single visibility (nChan, nBaselines, nPhaseCentres)
Memory @ the FxManager • Tweaking: visBufferLength • Multiplies the size of a single visibility (nChan, nBaselines, nPhaseCentres) • Generally not a problem • Note: visBufferLength should not be too short, especially if you have many (esp. heterogeneous) Core nodes, as the subints can come in out of order
CPU @ the Datastream • Loading of Datastream is usually pretty light • But, Datastream often runs on old hardware (e.g. Mk5 units) with limited CPU capacity • A couple of options can cause problematically high loads: • Tsys extraction (.v2d: tcalFreq = xx) • Interlaced VDIF formats (used with multi-thread VDIF data, e.g. phased EVLA) • More efficient implementations coming; for now, buy faster CPU if needed!
CPU @ the Core • Many considerations here, including parameters usually fixed by the science • Number of phase centres • Spectral resolution (nChan/nFFTChan) • Plus several on array management • strideLength • numBufferedFFTs • xmacLength • And then a few others as well: • nThreads • fringe rotation order
CPU @ the Core • Number of phase centers • For each phase centre, phase rotation and separate accumulation from thread to main buffer
CPU @ the Core • Number of phase centers • For each phase centre, phase rotation and separate accumulation from thread to main buffer • That costs CPU (proportional to number of baselines and number of phase centres), but also ensures that results don’t fit in cache (more later)
CPU @ the Core • Spectral resolution • More channels means a bigger FFT, and that costs CPU • Doesn’t typically follow a logN law like it should - bigger gets worse fast beyond ~1024 due to cache performance • Really big (>=8192 channels/subband) gets very expensive • Worst thing: typically comes in combination with multiple phase centres! (required to avoiding bandwidth smearing)
CPU @ the Core • Array management • #1: strideLength (auto setting usually best) 180° One FFT of data sin/cos the first “strideLength” samples, and every “strideLength”’th after that -180°
Mode 1 Mode 2 Mode 3 Mode N CPU @ the Core • Array management • #2: numBufferedFFTs (auto=10 usually ok) • Mitigates the cache miss problem by x10 … Precompute numBufferedFFTs FFT results, one station at a time But one slot fits in cache! Visibility buffer(too big for cache)
Mode 1 Mode 2 Mode 3 Mode N CPU @ the Core • Array management • #3: xmacLength (auto setting of 128 usually fine; further subdivides XMAC step) … Precompute numBufferedFFTs FFT results, one station at a time But one slot fits in cache! Visibility buffer(too big for cache)
CPU @ the Core • nThreads • Usually, set nThreads = n(CPU cores) - 1 • Occasionally, can be advantageous to use fewer threads (avoiding swap memory / cache contention)
CPU @ the Core • Fringe Rotation Order • Default is 1, and this is almost always fine • 2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?) • BUT: 0th order could often be used, and almost never is: it can be about 25% faster Fringerotationphase time 1st FFT 2nd FFT Here, fringe rate is too high for 0th order
CPU @ the Core • Fringe Rotation Order • Default is 1, and this is almost always fine • 2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?) • BUT: 0th order could often be used, and almost never is: it can be about 25% faster Fringerotationphase time 1st FFT 2nd FFT But at low fringe rate, 0th order approximation can be acceptable
CPU @ the Core • Fringe Rotation Order • Default is 1, and this is almost always fine • 2nd order only ever needed for very high fringe rates with very long FFTs (space VLBI of masers?) • BUT: 0th order could often be used, and almost never is: it can be about 25% faster • .v2d: fringeRotOrder = [0, 1, 2]
CPU @ the FxManager • CPU load at the FxManager is typically light - it only does low-cadence accumulation and scaling of visibilities • Very short subintNS can potentially lead to problems (although network issues are more likely)