370 likes | 479 Views
Group Meeting Presentation. Eason Cheng Sep 18, 2014. CrossNorm : a novel normalization strategy for microarray data in cancer. Outline. Background and Introduction Method and Datasets Results Conclusion and Discussion. Outline. Background and Introduction Method and Datasets
E N D
Group Meeting Presentation Eason Cheng Sep 18, 2014
CrossNorm: a novel normalization strategy for microarray data in cancer
Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion
Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion
Background • Purpose of preprocessing/normalization Gene Chip System variation Biological variation Gene expression profile To remove system variation (noise) while keeping biological variation (information for analysis)
Background 3 steps of preprocessing (noise removal): • Background Correction • Remove local artifacts and “noise” (within arrays) • measurements are not so affected by neighboring measurements • Normalization • Remove array effects (among arrays) • measurements from different arrays are comparable • Summarization • Combine probe (gene segment) intensities across arrays • final measurement represents gene expression level
Background 3 steps of preprocessing (noise removal): • Background Correction • Remove local artifacts and “noise” (within arrays) • measurements are not so affected by neighboring measurements • Normalization • Remove array effects (among arrays) • measurements from different arrays are comparable • Summarization • Combine probe (gene segment) intensities across arrays • final measurement represents gene expression level
Background Choice makes a difference: • MAS 5.0 • dChip • GCRMA • RMA • Convolution Background Correction • Quantile Normalization • http://en.wikipedia.org/wiki/Quantile_normalization • Tukey’s Median Polish
Background Assumption: • Only a few genes are DifferentiallyExpressed (DE) • Balanced upward and downword expression level changes • Forceing all arrays to have the same probe intensity distribution. Complicated disease? Cancer ?
Background Is the assumption valid for Cancer ? Figure 1. Box plot of sample median values before normalization in control (white) and cancer (grey) sample group for each dataset.
Background Is the assumption valid for Cancer ? Table 1. Comparison of sample medians of raw signal intensities between cancer and normal group. (10/18)
Background The influence of over normalization
Background We should note that: • Gene expressions tend to have excessive up-regulation in cancers. • Effective signals naturally exist in the raw data. • The assumption under most current norm algorithms may not hold true.
Background (Motivation) Assumptions: • Only a few genes are Differentially Expressed (DE) • Balanced upward and downword expression level changes • Forcing all arrays to have the same probe intensity distribution. X X X Complicated disease? Cancer ? The assumptions are NOT reasonable for Cancer Study We propose a novel normalization Strategy
Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion
Method Ask for novel methods: • Keep the property of extensive up-regulation • Do not over normalization • CrossNorm: Cross Normalization • LVS: Least Variation Set Normalization
Method: CrossNorm Cross Quantile Profile Profile after CrossQuan C C D D
Method: CrossNorm Cross Quantile Profile Profile after CrossQuan C C D D • Keep the rank order within an individual; avoid over normalization between conditions .
Method • Let be the expression profiles of the control arrays; and let be the expression profiles of the disease arrays. The ’s and ’s have the same length (the number of genes) . CrossNorm for the paired case where . • Form a matrix of columns , ; • Normalize the columns in any approach you intend, such as Quantile, to obtain a matrix with colums ; • Obtain the final normalized control arrays as , and the disease ones as .
Data sets AffymetrixSpike-in data det: • spike-in Human Genome U133 dataset • Spike-in DrosGenome1 data set Real-world cancer data set: 18 cancer data sets collected from
Data sets Affymetrix Spike-in Data Set: 1) Spike-in Human Genome U133 dataset • based on a latin-square experiment with 42 arrays • overall 42 spiked-in genes at various concentrations ranging from 0.0 to 512 pM. • Each concentration was performed with three replicates • each array contains 22,283 probes.
Data sets Affymetrix Spike-in Data Set: 2) Spike-in DrosGenome1 data set • A set of 14,010 probe sets • 3,866 had been assigned given concentration fold. • 2,535 probe sets were assigned unchanged concentration. (FC=1) • 1,331 with FC greater than 1, ranging from 1.2 to 4. (FC>1) • 10,144 empty probe sets • not spiked any concentration (removed in the project).
Data sets Cancer Datasets:
Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion
Result HG U133
Result DrosGenome1
Result DrosGenome1
Result Figure 1. Box plot of sample median values after CrossNormin control (white) and cancer (grey) sample group for each dataset.
Result • Identifying Differentially Expressed (DE) genes: • Fold Change (FC) with different thresholds. Assessment of reproducibility: • Percentage of Overlap Gene (POG) • POG is a score measuring the percentage of overlapping genes accounting for the total number of the two gene sets. • Direction Consistency (DC) ratio • DC ratio is the ratio of the genes that had the same regulation direction for both gene sets.
Result Table 2. (a) The consistency statistic of data sets for ESCC and Pancreatic cancer. • (b) The consistency statistic of data sets for ESCC and Pancreatic cancer. • DC: Direction Consistency; POG: Percent of Overlap Gene
Outline • Background and Introduction • Method and Datasets • Results • Conclusion and Discussion
Conclusion • CrossNorm is a modification of existing normalization methods to process microarray data sets with global shifts over samples. • It makes the most out of raw signal and maintain the regulation direction. • CrossNormoutperforms global normalizations as well as the already well-performed LVS normalization approach, when it comes to differential analysis with a high degree of biological variation.
Conclusion • CrossNormfully utilizing biological signal from the raw data rather than artificially presetting parameters or pre defining the proportion of assumed housekeeping genes, like LVS. • The applications is not restricted to cancer study, but also for researches comparing tissues and developmental stages as genes are expected to have high variation in both cases. • The strategy could also be extended to all sorts of baseline normalizations.
Discussion The identification of regulation direction of genes is of vital importance for the subsequent biological analysis, • expression correlation of gene productions • regulation relations between miRNA and target mRNA, • detecting the regulation direction of oncogene and tumor suppress genes.
Future work CrossNorm is a robust and unbiased procedure that could help us better understand the expressional difference among samples. • Correlation study • miRNA data • RNA-seq data • preprocessing of published data sets
Q & A THANK YOU!