1 / 32

DATA TRANSFORMATION and NORMALIZATION

DATA TRANSFORMATION and NORMALIZATION. Lecture Topic 4. DATA PRE-PROCESSING. TRANSFORMATION NORMALIZATION SCALING. DATA TRANSFORMATION. Difference between raw fluorescence is a meaningless number Data is transformed: Ratio allows immediate visualization of number Log. Why Log 2?.

cuellara
Download Presentation

DATA TRANSFORMATION and NORMALIZATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4

  2. DATA PRE-PROCESSING • TRANSFORMATION • NORMALIZATION • SCALING

  3. DATA TRANSFORMATION • Difference between raw fluorescence is a meaningless number • Data is transformed: • Ratio allows immediate visualization of number • Log

  4. Why Log 2? • Difference in expression intensity exist on a multiplicative scale, log transformation brings them into the additive scale, where a linear model may apply. • Ex. 4 fold repression=0.25 (Log2=-2) • Ex. 4 fold induction=4 ( Log2=2) • Ex. 16 fold induction=16 (Log2= 4) • Ex. 16 fold repression=0.0625 (Log2=-4) • Evens out highly skewed distributions • Makes variation of intensities…independent of absolute magnitude

  5. Log Transformation: Makes the distribution less skewed

  6. Example 2

  7. Non-parametric Regression: the Loess Method • LOWESS= LOESS is an Acronym for LOcally reWEighted ScatterPlot Smoothing (Cleveland). • For i=1 to n, the ith measurement yi of the response y and the corresponding measurement xi of the vector x of p predictors are related by • Yi=g(xi) + eI • where g is the regression function and ei is a random error. • Idea: g(x) can be locally approximated by a parametric function. • Obtained by fitting a regression surface to the data points within a chosen neighborhood of the point x.

  8. LOESS contd… • In the LOESS (LOWESS) method, weighted least squares is used to fit linear or quadratic functions of the predictors at the centers of neighborhoods. • The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. The fraction of the data, called the smoothing parameter, in each local neighborhood controls the smoothness of the estimated surface. • Data points in a given local neighborhood are weighted by a smooth decreasing function of their distance from the center of the neighborhood.

  9. Distance metrics used • Finding distance between the ith and hth points 2 predictors: • Distance between: (Xi1, Xi2) and (Xh1,Xh2): • Generally Eucledean Distance is used, • and weights are defined by a tri-cube function: • Choice of q is between 0 and 1, often between .4 to .6. • Large q: smoother but maybe too smooth • Small q: too rough

  10. Comments on LOESS ·       fitting is done at each point at which the regression surface is to be estimated ·       faster computational procedure is to perform such local fitting at a selected sample of points and then to blend local polynomials to obtain a regression surface ·       can use the LOESS procedure to perform statistical inference provided the error distribution are i.i.d. normal random variables with mean 0. ·       using the iterative reweighting, LOESS can also provide statistical inference when the error distribution is symmetric but not necessarily normal. ·       by doing iterative reweighting, you can use the LOESS procedure to perform robust fitting in the presence of outliers in the data.

  11. Data “Normalization” • To biologists, data normalization means “eliminating systematic noise” from the data • Noise is systematic variation – experimental variation, human error, variation of scanner technology, etc • Variation in which we are NOT interested • We are interested in measuring true biologic variation of genes across experiments, throughout time, etc. • Plays an important role in earlier stages of microarray data analysis. • Subsequent analysis are highly dependent on normalization. • NORMALIZATION: Adjusts from any bias which arises from microarray technology rather than biological

  12. Normalization:Age old Statistical Idea • Stands for removing bias as a result of experimental artifacts from the data. • Stems back to Fisher’s idea (1923) setting up of ANOVA. • There is a thrust to use ANOVA for normalization, but for the most part it is still a stage-wise approach instead of a model taking out all sources of variation at once. • We will need to look at: • Spatial correction • Background correction • Dye-effect correction • Within replicate rescaling • Across replicate rescaling • Within slide normalization • Paired slide normalization for dye swap • Multiple slide normalization

  13. M vs A plots • Used to look at agreement of variables intended to measure the same response. • Consider y1 and y2 measure the same variable (two reps of the same variable) • M or Minus = (y1-y2) • A or Average = (y1+y2)/2 • Often done in the log scale (M=log(y1/y2) A= log((y1*y2)/2) • If we plot M on the y-axis and A on the X axis, we expect to see a flat line, if the two variables do indeed measure the same thing.

  14. Code ma.data=read.csv("MA.csv",header=TRUE) head(ma.data) y1=ma.data$slide1 y2=ma.data$slide2 M=y1-y2 A=(y1+y2)/2 plot(y1,y2) plot(A,M) lw1 <- loess(M ~ A,data=ma.data,span=0.10) plot(M ~ A, data=ma.data) j <- order(A) lines(A[j],lw1$fitted[j],col="red",lwd=3) #idea of normalization fit=lw1$fitted newM=M-fit lw2 <- loess(newM ~ A,data=ma.data,span=0.10) plot(newM ~ A, data=ma.data) j <- order(A) lines(A[j],lw2$fitted[j],col=“green",lwd=3) #plotting all on the same plot plot(M ~ A, data=ma.data) lines(A[j],lw1$fitted[j],col="red",lwd=3) lines(A[j],lw2$fitted[j],col="green",lwd=3)

  15. Array 1: pre and post norm

  16. Comments: • Print-tip normalization is generally a good proxy for spatial effects • Instead of LOESS one can use SPLINE to estimate the trend to subtract from the raw data.

  17. BACKGROUND CORRECTION • Idea: • Signal = True Signal + Background • So, an attractive idea seems like we should subtract BACKGROUND from the signal to get to the “TRUE” signal. • The problem is that, the actual BACKGROUND in a spot cannot be measured and what is measured are really a “estimate” for background of places NEAR the spot. • Criticism: the assumption in these models is that the background is additive • OFTEN WE SEE HIGN CORRELATION BETWEEN FOREGROUND AND BACKGROUND. • GENERAL CONSENSUS THESE DAYS: NOT TO SUBTRACT LOCAL BACKGROUND, BUT POSSIBLY SUBTRACT A GLOBAL BACKGROUND (FROM EMPTY SPOTS OR BUFFERS).

  18. Background Correction: more thoughts • McClure and Wit (2004) suggest calculating the mean or median of the empty spots and estimate, signal as: • Signal = max(observed signal – center(empty spots), 0) This allows never to have the problem of negative “corrected signals”.

  19. Background Correction: Probabilistic Idea • Irrizary et al(2003) • Looks at finding the conditional expectation of the TRUE signal given the observed signal (which is assumed to be the true signal plus noise) • E(si | si+bi) • Here, si assumed to follow Exponential distribution with parameter q. • Bi assumed to follow N(me, s2e) • Estimate me and se as the mean and standard deviation of empty spots

  20. Irrizary Approach contd… • This allows the formula to be approximated by the following, where F, f are the CDF and pdf of the standard normal distribution:

  21. Normalization Approaches • GLOBAL Normalization (G): Global (ARRAY) Mean or Median. • NOT USED VERY OFTEN ANYMORE • Intensity dependent linear Normalization (L): by least square estimation • AGAIN NOT USED AS MUCH • Intensity dependent non-linear Normalization (N): Lowess curve (Robust scatter plot smoother) • Under ideal experimental conditions: M=0 for the selected genes used for normalization • THE MOST COMMONLY USED IDEA THESE DAYS.

  22. Normalization: Historical Approaches • Gobal normalization • Sum method: Norm coef.(kj) = Where Imi = intensity of gene i on array Array m, m=1,2 Bm= background intensity on Array m, m=1,2 n = number of genes on the array • problem: validity of the assumption; stronger signals dominate the summation. • Median (robust with respect to outliers) Normalization coefficient (kj) =

  23. Normalization continued • Housekeeping gene normalization • Housekeeping genes are a set of genes whose expression levels are not affected by the treatment. • The normalization coefficient is the ratio of mC/mT, where mC and mT are the means of the selected housekeeping genes for control and treatment respectively. • Problem: housekeeping genes change their expression level sometimes. The assumption doesn’t hold. • Trimmed mean normalization(adjusted global method) trim off 5% highest and lowest extreme values, then globally normalize data. The normalization coefficient is: where are the trimmed means for the ith treatment and control respectively.

  24. Ideal Control Spots that should be on an array As we saw in the previous slide, there can be many special probes spotted onto an array during its manufacture, collectively called control probes. These include • Blanks: places where water or nothing is spotted. • Buffer: where the buffer solution without DNA is spotted. • Negative: here there are DNA probes, but they shouldn’t be complementary to any target cDNA. • Calibration: probes corresponding to DNA put in the hyb mix which should have equal signals in the two channels. • Ratio: probes corresponding to DNA put in the hyb mix which should have known ratios between the two channels (e.g. 3:1,1:3, 10:1, 1:10).

  25. Normalization Within and Across Conditions • The Normalization WITHIN conditions is more common • Idea we want all the arrays that represent the SAME condition to be comparable. • Take out the array effect, in other words. • Many models for this: • Factorial model (Kerr et al, Wolfinger et al) • Location Scale Model (Yang et al) • Scaling (Affymetrix) Consider the data to be: xijk: ith spot, jth color, kth array

  26. Quantile Normalization Idea • Ideally “replicate” microarrays should be similar • In real life they are often NOT identically distributed • Quantile normalization FORCES the same distribution on all the arrays for the same condition

  27. Mathematical details: Quantile Normalization • {x} represent the matrix of all p spot intensities and the n replicate arrays. • Here, xik is the spot intensity of the ith spot (i=1,…p, k=1,…n). • Let x(k) = vector of the smallest spot intensities across the arrays • be the mean/median of x(j) • The vector represents the compromise distribution. {r} be the matrix of row ranks associated with matrix {x} • Then, the following are the quantile normalized value

  28. Numerical Example • Let us consider a situation where we have 5 spots on an array and two replicates for an array (numbers in brackets represents the ranks) • Spot 1 2 3 4 5 • Array 1 16(5) 0(1) 9(3) 11(4) 7(2) • Array2 13(4) 3(1) 5(2) 14(5) 8(3) • Order the arrays: 0 7 9 11 16 • Array 2 3 5 8 13 14 • Average these: 1.5 6 8.5 12 15 • Replace the ranks by these: • Normalized arrays are: • Array1: 15 1.5 8.5 12 6.0 • Array2: 12 1.5 6.0 15 8.5

  29. R code #need to install files from Bioconductor #For R version 3.6 onwards we need to do the following: if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install() BiocManager::install(c("GenomicFeatures", "AnnotationDbi")) BiocManager::install("affy") BiocManager::install("affyPLM") library(affyPLM) #load package library(preprocessCore) #create a matrix using the same example mat1=matrix(c(16,0,9,11,7,13,3,5,14,8),ncol=2) normalize.quantiles(mat1)

  30. Conclusion • No unique normalization method for the same data. It depends on what kind of experiment you have and what the data look like. • No absolute criteria for normalization. Basically, the normalized log ratio should be centered around 0. • Nowadays the focus IS on using Nonparametric Regression methods to remove trend or spatial artifacts from the data • Quantile normalization (though not liked by BIOLOGISTS) is catching on as well.

More Related