680 likes | 690 Views
Explore efficient data compression techniques for multidimensional nuclear spectra storage with preservation of information and minimal data distortion. Learn about methods such as symmetry removal, fast orthogonal transformation, and binning for compression.
E N D
Efficient algorithms of multidimensional γ-ray spectra compression • V. Matoušek and M. Morháč • Institute of Physics, Slovak Academy of Sciences, Bratislava, Slovakia • Vladislav.Matousek@savba.sk Miroslav.Morhac@savba.sk • ACAT 2005, Zeuthen May 22 - 27, 2005
The measurements of data in nuclear physics experiments are oriented towards gathering a large amount of multidimensional data. • The data are collected in the form of events. • In a typical experiment with spectrometers (Gammasphere, Euroball), each coincidence event consists of a set of n integers (e1, e2, …, en), which are proportional to the energies of the coincident -rays. • Such a coincidence specifies a point in an n-dimensional hypercube. • Storing of multidimensional data goes very frequently beyond the available storage media volumes.
Multiparameter nuclear data taken from experiments are typically stored: • directly by events and index the coincidences - list mode storage; • analyzed and stored as multidimensional histograms (hypercubes) - nuclear spectra. • List of events storing mode has several disadvantages: • enormous amount of information that has to be written onto storage media (primarily tapes), • long time needed to process the data.
Multidimensional histograms - nuclear spectra: • Advantages: • Possibility of interactive handling with data. • It allows easily to create slices of lower dimensionality. • Disadvantages: • The multidimesional amplitude analysis must be done. • Storage requirements for multidimensional hypercubes are enormous • e.g. 3-D γ-γ-γ coincidence nuclear spectrum with resolution of 14 bits (16 384 channels) per axis and 2 Bytes per channel requires 8 TB of memory. • Data often need to be stored in RAM for interactive handling. • It is needed to compress the multidimensional nuclear spectra to the size of available memory.
Suitable data compression techniques must satisfy these requirements: • Less storage space after compression of the multidimensional nuclear spectra • Preservation as much information as possible - minimum data distortion • Fast enough to be suitable for on-line compression during the experiment • Constrains: • The size of the original multidimensional spectrum goes beyond the capacity of available memory. • Data from nuclear experiments are received as a train of events - they need to be analyzed and compressed separately event by event • Thus, the multidimensional amplitude analysis must be performed together with compression, event by event in on-line acquisition mode.
Suitable methods widely used: • Binning - neighboring channels are summed together - loss of information. • Employing natural properties of data - e.g. symmetry removal from the multidimensional -ray spectra from Gammasphere - no loss of information. • Use of fast orthogonal transformation algorithms. • Storing the descriptors of events with counts of occurrences.
Symmetry removal: • For instance in multidimensional γ-ray spectra from Gammasphere one can utilize the property of symmetry of the data. It holds: • for 2-dimensional spectra: E(1, 2) = E(2, 1) • for 3-dimensional spectra
Principle of storage of 2-dimensional symmetrical data: • The size of reduced space can be simply expressed: Two-dimensional symmetrical spectra with resolution R = 4.
By composition of triangles of the sizes R, R-1, ..., 2, 1 we get the geometrical shape called tetrahedron. • An example of storage 3-dimensional symmetrical data in the form of tetrahedron. • The size (volume) of the reduced space of the tetrahedron is
In case of 4-dimensional data by compositions of tetrahedrons we obtain hyperhedron of the 4-th order for R = 4. The volume of the hyperhedron of 4-th order can be expressed as
The achievable compression ratios and storage requirements for typical spectra (14 bit ADCs and 2 Bytes per channel): • Radware package - the author combines both utilizing the property of symmetry and binning. Three-fold coincidences are stored in the form of cubes with the sizes 8 x 8 x 8. Inside of each cube the data are binned so that they span entirely the resolution 8192 channels in each dimension.
Compression methods using orthogonal transformations: • The multidimensional array, hypercube, is transformed into a new data array in transform domain, where the maximum amount of information is concentrated into smaller number of elements. • The basic premise is that the transformation of a signal has an energy distribution more amenable to retaining the shape of the data than the spatial domain representation. • Because of the inherent element-to-element correlation, the energy of the signal in the transform domain tends to be clustered in a relatively small number of transform coefficients.
The advantages of using fast orthogonal transforms: • Existing fast algorithms allowing their on-line implementation. • Linearity of the transforms. The signal that is being compressed need not to be stored statically in the memory. Each event can be transformed in time separately. The predetermined transformed coefficients are summed (analysis with on-line compression).
Fixed kernel orthogonal transforms usually employed in data compression: • Discrete Cosine, Walsh-Hadamard, Fourier, Hartley and other transforms. • Haar transform - the first and simplest scaling function of the mother wavelet suitable for generating an orthonormal wavelet basis. • The use of classical orthogonal transforms is very efficient provided that the form of compressed data resembles the form of the transform base functions. • The efficiency of the compression strongly depends on the nature of the experimental data. • Fourier transform and DCT are well suited to compress cosine and sine data shapes, whereas the Walsh-Hadamard transform is suitable to compress rectangular shapes in the input data.
There arose an idea to modify the shape of the base functions of the orthogonal transform so that the maximum possible compression of the multidimensional spectra can be achieved: • We have proposed the fast orthogonal transform with transform kernel adaptable to the reference vectors representing the processed data. • The structure of the signal flow graph is the Cooley-Tukey's type. • The principle of the method consists in direct modification of the multiplicative coefficients a, b, c, d of the signal flow graph in such a way that the base functions approximate the shape of the reference vector.
Let us illustrate the method for the case of size of the transform N = 4. • Signal flow graph of the fast adaptive orthogonal transform.
Basic element of the signal flow graph. • The coefficients of basic element of the signal flow graph are calculated as • , , , , • where x0, x1 are values of the reference vector. • The values y0, y1 at the output are: • , .
The transform coefficients are calculated in such a way that for the reference vector at the input they transform it into the one point at the output. • We have proposed the fast algorithm of on-line multidimensional amplitude analysis with compression using adaptive Walsh transform: • removes the necessity to store whole spectrum before compression, compression is performed event by event, • it is optimized so that only a minimum number of operations are needed. • The above mentioned principle of adaptability can be applied also for other transform structures. • The compression is achieved by discarding pre-selected elements in the transformed multidimensional array. • Two basic methods for element selection: • zonal sampling • threshold sampling.
Block data compression using orthogonal transforms with symmetry removal: • In case of 3-dimensional space, it is divided into cubes. Each cube of the size S S Swill be compressed to the cube of the size C C C. • We assume: • The sizes of cubes are equal in all dimensions. • The number of cubes in each dimension and their sizes S, C are the power of 2.
The number of cubes in the tetrahedron is • Where • R is the number of channels (e.g. resolution of ADC), S is the size of cube before the compression. • For each cube we have to define adaptive transform and consequently we need to store its coefficients. • The number of transform coefficients for one dimension is
The elements are stored in the float format (4 Bytes). The transform coefficients must be stored for each dimension, thus to store 3-dimen-sional compressed data we need • Bytes of storage media. • Then in general for D-dimensional data the size of needed memory is • We have to adhere to the following rules: • the size of the cube of original data S should be as small as possible. • the size of the cube after compression, C , should be the biggest possible (C ≤ S), i.e., we desire smallest possible compression. • the data volume for the chosen combination C, S must fit the size of memory available.
We have compressed histograms for 3-, 4-, 5-fold γ-ray coincidences of the event data from Gammasphere. The following sizes of cubes were chosen for block transform compression of multidimensional γ-ray spectra of 16 384 channels per axis and 4 Bytes for each channel.
Examples achieved by employing compression on 3-fold γ-ray spectra with symmetry removal: • Slice from original data (thin line) and decompressed slice (thick line) from data compressed by employing binning operation (Radware)
Slice from original data (thin line) and decompressed slice (thick line) from data compressed by employing adaptive Walsh transform.
Two-dimensional slice from data compressed by employing binning operation (Radware).
Two-dimensional decompressed slice from data compressed via adaptive Walsh transform.
Three-dimensional original spectrum (sizes of spheres are proportional to counts the channels contain).
Three-dimensional spectrum decompressed from data compressed via adaptive Walsh transform. Due to the smoothing effect of the adaptive transform some information is lost.
Similar experiments were done with 4-fold coincidence γ-ray spectra. • One-dimensional slice from original 4-dimensional spectrum (thin line) and the same slice decompressed from data compressed via adaptive Walsh transform (thick line). Due to enormous compression ratio the distortion of data in some regions is considerable. On the other hand in some regions the fidelity of the method is satisfactory.
Two-dimensional slice decompressed from 4-dimensional data compressed via adaptive Walsh transform
Compression of multidimensional γ-ray coincidence spectra using list of descriptors. • The input data describing an external event can be expressed using descriptor. Each descriptor describes fully the event. • This method is based on maintaining the list of descriptors. • The number of different descriptors, which actually occurred during an experiment is much smaller than the number off all possible descriptors. • So, the multidimensional space has empty regions. • Conventional analyzer - The descriptor defines the location in the memory at which the counts (number of occurrences of the descriptor) is stored. The range of descriptors is defined by the size of the memory.
An alternative technique - Store only those descriptors that actually occurred in the experiment. • The correspondence between the location and the description is lost, it is necessary to store the descriptor as well as associated counts. • When a new event comes, it must be sorted into its channel in a list by using its descriptor: • The problem is to devise a procedure for assigning the descriptor location number so that the time needed to store or read out a descriptor is minimized. • There exist several retrieval algorithms:
Sequential method: An obvious routine for searching the list on the memory is to compare the descriptor of a new event with the descriptor in each location starting at the first one. When a match is found, the associated count is increased by one. Such an algorithm is time consuming and cannot be accepted for on-line applications. • Sequential retrieval of events
Tree method: A considerable reduction of access time can be achieved by using a tree search algorithm. The descriptor of a new event is compared repeatedly with descriptors arranged in a tree. The main disadvantage of this technique is its complexity and amount of redundant information given by address pointers. • Tree search algorithm of event retrieval.
Partitioning and indexing method - It is the combination of the two previous methods and is implemented e.g. in the database Blue for high-fold -ray coincidence data*. • The hypercube is partitioned in high and low density regions. Each node of the tree represents a subvolume of the n-dimensional hypercube. The left and right child nodes represent the bisected volume of the parent. Associated with each leaf-node is a sublist of descriptors falling into appropriate geometric volume. They are arranged according sequential retrieval algorithm. • [*] Cromaz M. et al.: Blue: a database for high-fold -ray coincidence data, NIM A 462 (2001) 519.
Pseudo-random transformation of addresses of locations of descriptors. Requirements: • Uniform (or quasi-uniform) distribution of descriptors over memory addresses for any shape of multidimensional spectra. • Clusters of descriptors in physical field, hypercube, must be spread over the whole range of possible addresses and adjacent descriptors must go to addresses far away from each other. • Transformation must be fast, so that it can be applied on-line for high-rate experiments. • Unlimited number of methods of generation of pseudorandom numbers: • residues of modulo operations, Hamming’s code technique, transformation through the division of polynomials, etc.
One of the methods satisfying the above stated criteria and give pseudorandom distribution is based on the assignment of inverse number (in the sense of modulo arithmetic) to each address in original space. • where M is prime. • This operation can be carried out through the use of look-up table of pre-computed inverse numbers. • Through the transformation each descriptor uniquely derives its storage address. • There is possibility of more descriptors being transformed to the same address. To overcome this serious limitation, the transformation is used only to generate an address at which to start searching in the bucket of descriptors.
A list of successive locations, where d is the depth of searching are checked: • If in a location the descriptor coincides with read out descriptor, the counts in this location are incremented. • If no descriptor coincides with read out descriptor and in a location within search depth is empty location, the descriptor is written to this location and its count is set to 1. • If there is no empty location within search depth and no descriptor coincides with read out descriptor, additional processing is done. • During the experiment, the events with higher counts (statistics) occur earlier and therefore there is a higher probability that free positions will be occupied by statistically relevant events. • One can utilize additional information and to store the events with the highest weights, i.e., the highest probability of occurrence.
Provided that all locations for the depth d are occupied and the descriptor did not occur in this region, we scan the region once more and find the event with the smallest probability of occurrence • Then we compare the probability of the processed event pk with pj. If pk > pj we replace the descriptor in the position j with descriptor of the processed event and we set the counts of the event to 1. Otherwise, the processed event is ignored. • How to determine the probabilities of the occurrences of events? • Several approaches are possible in practice. • One of them is to utilize marginal (projection) spectra for each dimension. Then for n-dimensional event with the event values
this probability can be defined • where si is marginal spectrum for dimension i. • However many other definitions and approaches are possible. • Example of 3-fold coincidence -ray spectra storing: • The descriptor of each event contains the addresses x, y, z and counts (short integers), i.e., each event takes 8 Bytes. • We utilize again the property of symmetry of the multidimensional γ-ray spectra. Then chosen prime module has to satisfy the condition
For the 384 MB memory we have chosen the prime module M = 601. Assignment between numbers from ring <1,600> and their modulo inverse numbers.
Spectrum of distances between two adjacent modulo inverse numbers. • One can observe great scattering in these distances. This allows quasi-uniform distribution of descriptors in the transformed area.
We utilize the property of symmetry in γ-ray coincidence spectra. • The algorithm of calculation of the address of an event in the transformed area: • arrange the coordinates so that x ≤ y ≤ z • calculate , , • calculate , , • calculate address in the transformed area • This defines the beginning position of the searching for a given descriptor.
The whole linear array of descriptors (36 361 808 items) have been mapped to the 16384 channels spectrum. One can observe quasi-constant distribution, which witnesses about quasi-uniform distribution of descriptors over all memory addresses in the transform domain. • Distribution of descriptor counts in the transformed domain.
Prime module M, memory requirements and achieved compression ratio for 3-, 4- and 5-fold -ray spectra (16 384 channels in each dimension): • The searching depth for all cases is 1000 events.
Three-fold coincidence spectra. • High counts region of 1-dimesional slice from original data (thick line) and corresponding region from compressed data (thin line).
Low counts region of slice from original data (thick line) and corresponding from compressed data (thin line).
Influence of searching depth on quality of decompressed spectra • Increasing the length of searching in the buffer of compressed events improves the preservation of the peak heights. • In all spectra we subtracted background.