Dr. John R. Jensen Department of Geography University of South Carolina Columbia, SC 29208

Image Quality Assessment and Statistical Evaluation Dr. John R. Jensen Department of Geography University of South Carolina Columbia, SC 29208 Jensen, 2004

Image Quality Assessment and Statistical Evaluation • Many remote sensing datasets contain high-quality, accurate data. Unfortunately, sometimes error (or noise) is introduced into the remote sensor data by: • the environment (e.g., atmospheric scattering), • random or systematic malfunction of the remote sensing • system (e.g., an uncalibrated detector creates striping), or • improper airborne or ground processing of the remote sensor • data prior to actual data analysis (e.g., inaccurate analog-to- • digital conversion). Jensen, 2004

Image Quality Assessment and Statistical Evaluation • Therefore, the person responsible for analyzing the digital remote sensor data should first assess its quality and statistical characteristics. This is normally accomplished by: • looking at the frequency of occurrence of individual • brightness values in the image displayed in a histogram • viewing on a computer monitor individual pixel brightness • values at specific locations or within a geographic area, • computing univariate descriptive statistics to determine if • there are unusual anomalies in the image data, and • computing multivariate statistics to determine the amount of • between-band correlation (e.g., to identify redundancy). Jensen, 2004

Image Processing Mathematical Notation • The following notation is used to describe the mathematical operations applied to the digital remote sensor data: • i = a row (or line) in the imagery • j = a column (or sample) in the imagery • k = a band of imagery • l = another band of imagery • n = total number of picture elements (pixels) in an array • BVijk = brightness value in a row i, column j, of band k • BVik = ith brightness value in band k Jensen, 2004

Image Processing Mathematical Notation • BVil= ith brightness value in band l • mink = minimum value of band k • maxk = maximum value of band k • rangek = range of actual brightness values in band k • quantk = quantization level of band k (e.g., 28 = 0 to 255; 212 = 0 to 4095) • µk = mean of band k • vark = variance of band k • sk = standard deviation of band k Jensen, 2004

Image Processing Mathematical Notation • skewnessk = skewness of a band k distribution • kurtosisk = kurtosis of a band k distribution • covkl = covariance between pixel values in two bands, • k and l • rkl = correlation between pixel values in two bands, • k and l • Xc = measurement vector for class c composed of brightness values (BVijk) from row i, column j, and band k Jensen, 2004

Image Processing Mathematical Notation • Mc = mean vector for class c • Md = mean vector for class d • µck = mean value of the data in class c, band k • sck = standard deviation of the data in class c, band k • vckl = covariance matrix of class c for bands k through l; shown as Vc • vdkl= covariance matrix of class d for bands k through l; shown as Vd Jensen, 2004

Remote Sensing Sampling Theory A population is an infinite or finite set of elements. An infinite population could be all possible images that might be acquired of the Earth in 2004. All Landsat 7 ETM+ images of Charleston, S.C. in 2004 is a finite population. A sample is a subset of the elements taken from a population used to make inferences about certain characteristics of the population. For example, we might decide to analyze a June 1, 2004, Landsat image of Charleston. If observations with certain characteristics are systematically excluded from the sample either deliberately or inadvertently (such as selecting images obtained only in the spring of the year), it is a biased sample. Sampling error is the difference between the true value of a population characteristic and the value of that characteristic inferred from a sample.

Remote Sensing Sampling Theory • Large samples drawn randomly from natural populations usually produce a symmetrical frequency distribution. Most values are clustered around some central value, and the frequency of occurrence declines away from this central point. A graph of the distribution appears bell shaped and is called a normal distribution. • Many statistical tests used in the analysis of remotely sensed data assume that the brightness values recorded in a scene are normally distributed. Unfortunately, remotely sensed data may not be normally distributed and the analyst must be careful to identify such conditions. In such instances, nonparametric statistical theory may be preferred. Jensen, 2004

Common Symmetric and Skewed Distributions in Remotely Sensed Data Jensen, 2004

Remote Sensing Sampling Theory • The histogram is a useful graphic representation of the information content of a remotely sensed image. • It is instructive to review how a histogram of a single bandof imagery, k, composed of i rows and j columns with a brightness value BVijk at each pixel location is constructed. Jensen, 2004

Histogram of A Single Band of Landsat Thematic Mapper Data of Charleston, SC Jensen, 2004

Histogram of Thermal Infrared Imagery of a Thermal Plume in the Savannah River Jensen, 2004

Remote Sensing Metadata Metadata is “data or information about data”. Most quality digital image processing systems read, collect, and store metadata about a particular image or sub-image. It is important that the image analyst have access to this metadata information. In the most fundamental instance, metadata might include: the file name, date of last modification, level of quantization (e.g, 8-bit), number of rows and columns, number of bands, univariate statistics (minimum, maximum, mean, median, mode, standard deviation), perhaps some multivariate statistics, geo-referencing performed (if any), and pixel size. Jensen, 2004

Viewing Individual Pixels • Viewing individual pixel brightness values in a remotely sensed image is one of the most useful methods for assessing the quality and information content of the data. Virtually all digital image processing systems allow the analyst to: • use a mouse-controlled cursor (cross-hair) to identify a geographic location in the image (at a particular row and column or geographic x,y coordinate) and display its brightness value in n bands, • display the individual brightness values of an individual band in a matrix (raster) format. Jensen, 2004

Cursor and Raster Display of Brightness Values Jensen, 2004

Individual Pixel Display of Brightness Values Jensen, 2004

Raster Display of Brightness Values Jensen, 2004

Two- and Three-Dimensional Evaluation of Pixel Brightness Values within a Geographic Area Jensen, 2004

Univariate Descriptive Image Statistics • Measures of Central Tendency in Remote Sensor Data • The mode is the value that occurs most frequently in a distribution and is usually the highest point on the curve (histogram). It is common, however, to encounter more than one mode in a remote sensing dataset. The histograms of the Landsat TM image of Charleston, SC and the predawn thermal infrared image of the Savannah River have multiple modes. They are nonsymmetrical (skewed) distributions. • The median is the value midway in the frequency distribution. One-half of the area below the distribution curve is to the right of the median, and one-half is to the left. Jensen, 2004

Common Symmetric and Skewed Distributions in Remotely Sensed Data Jensen, 2004

Univariate Descriptive Image Statistics The meanis the arithmetic average and is defined as the sum of all brightness value observations divided by the number of observations. It is the most commonly used measure of central tendency. The mean(mk) of a single band of imagery composed of nbrightness values (BVik) is computed using the formula: The sample mean, mk, is an unbiased estimate of the population mean. For symmetrical distributions, the sample mean tends to be closer to the population mean than any other unbiased estimate (such as the median or mode). Unfortunately, the sample mean is a poor measure of central tendency when the set of observations is skewed or contains an outlier. Jensen, 2004

Hypothetical Dataset of Brightness Values Jensen, 2004

Univariate Statistics for the Hypothetical Sample Dataset Jensen, 2004

Remote Sensing Univariate Statistics - Variance Measures of Dispersion Measures of the dispersion about the mean of a distribution provide valuable information about the image. For example, the range of a band of imagery (rangek)is computed as the difference between the maximum (maxk) and minimum (mink) values; that is, Unfortunately, when the minimum or maximum values are extreme or unusual observations (i.e., possibly data blunders), the range could be a misleading measure of dispersion. Such extreme values are not uncommon because the remote sensor data are often collected by detector systems with delicate electronics that can experience spikes in voltage and other unfortunate malfunctions. When unusual values are not encountered, the range is a very important statistic often used in image enhancement functions such as min–max contrast stretching. Jensen, 2004

Remote Sensing Univariate Statistics - Variance Measures of Dispersion The variance of a sample is the average squared deviation of all possible observations from the sample mean. The variance of a band of imagery, vark, is computed using the equation: The numerator of the expression is the corrected sum of squares (SS). If the sample mean (mk) were actually the population mean, this would be an accurate measurement of the variance. Jensen, 2004

Remote Sensing Univariate Statistics Unfortunately, there is some underestimation because the sample mean was calculated in a manner that minimized the squared deviations about it. Therefore, the denominator of the variance equation is reduced to n – 1, producing a larger, unbiased estimate of the sample variance: Jensen, 2004

Univariate Statistics for the Hypothetical Example Dataset Jensen, 2004

Remote Sensing Univariate Statistics Thestandard deviation is the positive square root of the variance. The standard deviation of the pixel brightness values in a band of imagery, sk, is computed as Jensen, 2004

Jensen, 2004

Univariate Statistics for the Hypothetical Example Dataset Jensen, 2004

Measures of Distribution (Histogram) Asymmetry and Peak Sharpness Skewness is a measure of the asymmetry of a histogram and is computed using the formula: A perfectly symmetric histogram has a skewness value of zero. Jensen, 2004

Measures of Distribution (Histogram) Asymmetry and Peak Sharpness A histogram may be symmetric but have a peak that is very sharp or one that is subdued when compared with a perfectly normal distribution. A perfectly normal distribution (histogram) has zero kurtosis. The greater the positive kurtosis value, the sharper the peak in the distribution when compared with a normal histogram. Conversely, a negative kurtosis value suggests that the peak in the histogram is less sharp than that of a normal distribution. Kurtosis is computed using the formula: Jensen, 2004

Remote Sensing Multivariate Statistics Remote sensing research is often concerned with the measurement of how much radiant flux is reflected or emitted from an object in more than one band (e.g., in red and near-infrared bands). It is useful to compute multivariate statistical measures such as covariance and correlation among the several bands to determine how the measurements covary. Later it will be shown that variance–covariance and correlation matrices are used in remote sensing principal components analysis (PCA), feature selection, classification and accuracy assessment. Jensen, 2004

Remote Sensing Multivariate Statistics The different remote-sensing-derived spectral measurements for each pixel often change together in some predictable fashion. If there is no relationship between the brightness value in one band and that of another for a given pixel, the values are mutually independent; that is, an increase or decrease in one band’s brightness value is not accompanied by a predictable change in another band’s brightness value. Because spectral measurements of individual pixels may not be independent, some measure of their mutual interaction is needed. This measure, called the covariance, is the joint variation of two variables about their common mean. Jensen, 2004

Remote Sensing Multivariate Statistics To calculate covariance, we first compute the corrected sum of products (SP) defined by the equation: Jensen, 2004

Remote Sensing Univariate Statistics Remote Sensing Multivariate Statistics It is computationally more efficient to use the following formula to arrive at the same result: This quantity is called the uncorrected sum of products. Jensen, 2004

Remote Sensing Multivariate Statistics Just as simple variance was calculated by dividing the corrected sums of squares (SS) by (n – 1), covariance is calculated by dividing SP by (n – 1). Therefore, the covariance between brightness values in bands k and l,covkl, is equal to: Jensen, 2004

Format of a Variance-Covariance Matrix Jensen, 2004

Computation of Variance-Covariance Between Bands 1 and 2 of the Sample Data Jensen, 2004

Variance-Covariance Matrix of the Sample Data Jensen, 2004

Correlation between Multiple Bands of Remotely Sensed Data To estimate the degree of interrelation between variables in a manner not influenced by measurement units, the correlation coefficient, r,is commonly used. The correlation between two bands of remotely sensed data, rkl, is the ratio of their covariance (covkl) to the product of their standard deviations (sksl); thus: Jensen, 2004

Correlation between Multiple Bands of Remotely Sensed Data If we square the correlation coefficient (rkl), we obtain the sample coefficient of determination (r2), which expresses the proportion of the total variation in the values of “band l” that can be accounted for or explained by a linear relationship with the values of the random variable “band k.” Thus a correlation coefficient (rkl) of 0.70 results in an r2 value of 0.49, meaning that 49% of the total variation of the values of “band l” in the sample is accounted for by a linear relationship with values of “band k”. Jensen, 2004

Correlation Matrix of the Sample Data Jensen, 2004

Band Min Max Mean Standard Deviation 1 51 242 65.163137 10.231356 2 17 115 25.797593 5.956048 3 14 131 23.958016 8.469890 4 5 105 26.550666 15.690054 5 0 193 32.014001 24.296417 6 0 128 15.103553 12.738188 7 102 124 110.734372 4.305065 Covariance Matrix Band Band 1 Band 2 Band 3 Band 4 Band 5 Band 6 Band 7 1 104.680654 58.797907 82.602381 69.603136 142.947000 94.488082 24.464596 2 58.797907 35.474507 48.644220 45.539546 90.661412 57.877406 14.812886 3 82.602381 48.644220 71.739034 76.954037 149.566052 91.234270 23.827418 4 69.603136 45.539546 76.954037 246.177785 342.523400 157.655947 46.815767 5 142.947000 90.661412 149.566052 342.523400 590.315858 294.019002 82.994241 6 94.488082 57.877406 91.234270 157.655947 294.019002 162.261439 44.674247 7 24.464596 14.812886 23.827418 46.815767 82.994241 44.674247 18.533586 Correlation Matrix Band Band 1 Band 2 Band 3 Band 4 Band 5 Band 6 Band 7 1 1.000000 0.964874 0.953195 0.433582 0.575042 0.724997 0.555425 2 0.964874 1.000000 0.964263 0.487311 0.626501 0.762857 0.577699 3 0.953195 0.964263 1.000000 0.579068 0.726797 0.845615 0.653461 4 0.433582 0.487311 0.579068 1.000000 0.898511 0.788821 0.693087 5 0.575042 0.626501 0.726797 0.898511 1.000000 0.950004 0.793462 6 0.724997 0.762857 0.845615 0.788821 0.950004 1.000000 0.814648 7 0.555425 0.577699 0.653461 0.693087 0.793462 0.814648 1.000000 Univariate and Multivariate Statistics of Landsat TM Data of Charleston, SC Jensen, 2004

Feature Space Plots The univariate and multivariate statistics discussed provide accurate, fundamental information about the individual band statistics including how the bands covary and correlate. Sometimes, however, it is useful to examine statistical relationships graphically. Individual bands of remotely sensed data are often referred to as features in the pattern recognition literature. To truly appreciate how two bands (features) in a remote sensing dataset covary and if they are correlated or not, it is often useful to produce a two-band feature space plot. Jensen, 2004

Feature Space Plots A two-dimensional feature space plot extracts the brightness value for every pixel in the scene in two bands and plots the frequency of occurrence in a 255 by 255 feature space (assuming 8-bit data). The greater the frequency of occurrence of unique pairs of values, the brighter the feature space pixel. Jensen, 2004

Two-dimensional Feature Space Plot of Landsat Thematic Mapper Band 3 and 4 Data of Charleston, SC obtained on November 11, 1982 Jensen, 2004

Geostatistical Analysis of Remote Sensor Data The Earth’s surface has distinct spatial properties. The brightness values in imagery constitute a record of these spatial properties. The spatial characteristics may take the form of texture or pattern. Image analysts often try to quantify the spatial texture or pattern. This requires looking at a pixel and its neighbors and trying to quantify the spatial autocorrelation relationships in the imagery. But how do we measure autocorrelation characteristics in images? Jensen, 2004

Geostatistical Analysis of Remote Sensor Data A random variable distributed in space (e.g., spectral reflectance) is said to be regionalized. We can use geostatistical measures to extract the spatial properties of regionalized variables. Once quantified, the regionalized variable properties can be used in many remote sensing applications such as image classification and the allocation of spatially unbiased sampling sites during classification map accuracy assessment. Another application of geostatistics is the prediction of values at unsampled locations. Geostatistical interpolation techniques could be used to evaluate the spatial relationships associated with the existing data to create a new, improved systematic grid of elevation values. Jensen, 2004

Dr. John R. Jensen Department of Geography University of South Carolina Columbia, SC 29208