400 likes | 702 Views
Microarrays. Naomi Altman Dept. of Statistics and PSU March 7, 2007. Microarrays 100 A Statistician’s Simplification.
E N D
Microarrays Naomi Altman Dept. of Statistics and PSU March 7, 2007
Microarrays 100 A Statistician’s Simplification A microarray is a piece of glass or polymer with several thousand spots, each of which contains thousands of copies of a short piece of 1 strand of the "double helix" of DNA or of cDNA (to be explained). The rungs of the DNA ladder consists of 2 bound codons which are designated C, G, A, T. These are called base pairs. Each spot on a microarray consists of a piece of one side of the ladder with the attached base. C binds only to G. A binds only to T. A sample containing an unknown number of the complementary strands is labeled and hybridized to the array. The response is a measure of the quantity of label for each spot, which should be proportional to the number of complementary strands in the sample. http://www.bioteach.ubc.ca/MolecularBiology/AMonksFlourishingGarden/
Agenda • DNA 100 - what we need to know to understand what a microarray can measure. • What can a microarray measure? • Where does the material printed on the microarray come from? • What does a microarray experiment "look like" and where do statistical methods fit in? • (Time permitting) Gene expression experiments and the p >> n problem.
DNA 100 A Statistician’s Simplification Every cell in an organism has the same genetic material, stored in the double helix of DNA. In a diploid population, most cells have 2 copies of each chromosome. Genes are the part of the DNA that code for proteins, but there are many other important features that interest biologists. http://www.accessexcellence.org/RC/VL/GG/chromosome.html
Transcription (Making RNA) • transcription factors bind to the promoter and bind RNA polymerase • DNA strands separate and transcription is initiated • transcription continues in the 3'-5' direction until the stop codons are reached • The completed RNA strand is released for post-processing www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm
Introns and Exons promoter In "higher" organisms, the gene contains noncoding regions, called introns, and coding regions called exons. The introns are spliced out of the mRNA before translation into protein. "Splicing variants" can be formed by the cell selecting combinations of the exons. The resulting spliced strand is the mRNA. We can "predict" exons using statistical algorithms, but the gold standard is that only exons match mRNA sequences At each end of the mRNA is an untranslated region (UTR) which is unique to the gene. Chromosome http://biology.unm.edu/ccouncil/Biology_124/Summaries/T&T.html
cDNA RNA is much less stable than DNA. To preserve the exon sequence, and for printing microarrays, reverse transcription is used in the lab to convert the RNA into the complementary cDNA. cDNA can be preserved by inserting it into the genome of a living microbe (cDNA library).
DNA 100 A Statistician’s Simplification DNA is complicated stuff. Protein-coding regions are called genes. There are also other functional parts to the DNA, some of which code for RNA and some of which are regulatory regions - i.e. they help control how the coding regions are used - e.g. promoters The supercoiling of the DNA may also control how the coding regions are used. As well, there is a lot of DNA which appears to be "junk" - i.e. to date no function is known. But we keep making new discoveries - e.g. some of the "junk" codes for small RNA pieces that are functional.
What can be measured on a microarray? 1. Amount of mRNA expressed by a gene. 2. Amount of mRNA expressed by an exon. 3. Amount of RNA expressed by a region of DNA. 4. Which strand of DNA is expressed. 5. Which of several similar DNA sequences is present in the genome. 6. How many copies of a gene is present in the genome. 7. Where a known protein has bound to the DNA. (ChIP on chip)
What can be measured on a microarray? 1. Amount of mRNA expressed by a gene. gene expression array, exon array, tiling array 2. Amount of mRNA expressed by an exon. exon array, tiling array 3. Amount of RNA expressed by a region of DNA. tiling array 4. Which strand of DNA is expressed. exon array, tiling array 5. Which of several similar DNA sequences is present in the genome. SNP array 6. How many copies of a gene is present in the genome. gene expression array, exon array, tiling array 7. Where a known protein has bound to the DNA. (ChIP on chip) promoter array, tiling array
Types of Microarrays cDNA sequence UTR Exon 1 Exon 2 Exon 3 UTR oligo exon exon exon cDNA chromosome sequence CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA CCGTTCACA AAGGCCGTT SNP CCGTGCACA AAGGACGTT tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile promoter A cDNA microarray can be made from the unsequenced cDNA library. All the other types require that the sequence be available.
Print Technology The cDNA or oligo can be: 1. Printed on the slide using an "arraying robot" which deposits a drop of liquid containing the material at each spot. (gene expression only) 40,000+ spots 2. Oligos (all the same length) can be synthesized on the slide using: i) inkjet technology ii) photolithography 1,000,000+ spots 3. There are other technologies that give similar types of results (e.g. "beads").
Spotted 2-Channel Array Spotted arrays are printed on coated microscope slides. 2 RNA samples are converted to cDNA. Each is labelled with a different dye. http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg
Format of an Affymetrix Array http://cnx.rice.edu/content/m12388/latest/figE.JPG
Microarray experiments Print or buy the microarray Obtain sequence info select oligos Print microarray
Microarray experiments Print or buy the microarray sequencing error assembly error contamination Obtain sequence info unique similar hybridization rates select oligos Print microarray
Microarray experiments Create the labeled samples Print or buy the microarray Obtain sequence info obtain tissue sample select oligos extract RNA extract mRNA Print microarray normalize mRNA label
Microarray experiments Create the labeled samples Print or buy the microarray experimental design -number of biological replicates -technical replicates blocks sample pooling Obtain sequence info obtain tissue sample select oligos extract RNA extract mRNA Print microarray normalize mRNA label
Microarray experiments Create the labeled samples Print or buy the microarray Obtain sequence info obtain tissue sample select oligos extract RNA extract mRNA Print microarray normalize mRNA label hybridize
Microarray experiments Create the labeled samples Print or buy the microrray Obtain sequence info obtain tissue sample select oligos extract RNA extract mRNA Print microarray normalize mRNA label hybridize hybridization design (multichannel)
Microarray experiments scan process image hybridize detect spots detect background compute spot summary detect bad spots remove array specific noise
Microarray experiments scan process image using multiple scans hybridize detect spots spot detection software detect background pixel mean, median ... background correction compute spot summary detection limit background > foreground badly printed spots flaws detect bad spots remove array specific noise normalization
Spot Detection Rafael A Irizarry,Department of Biostatistics JHUrafa@jhu.eduhttp://www.biostat.jhsph.edu/~ririzarrhttp://www.bioconductor.org nci 2002 Adaptive segmentation Fixed circle segmentation ---- GenePix ---- QuantArray ---- ScanAnalyze Spot uses morphological opening
Gene Expression Microarray experiments obtain numerical summary for each gene or exon on each array differential expression analysis sample classification clustering genes and samples
Gene Expression Microarray experiments obtain numerical summary for each gene or exon on each array robust methods to downweight outliers data imputation (if needed) t-tests, ANOVA Bayesian versions of above Fourier analysis of time series False discovery and nondiscovery rates differential expression analysis sample classification clustering genes and samples unsupervised learning hierarchical clustering k-means clustering heatmaps discriminant analysis support vector machines supervised learning
sample clusters show that different regions of the brain cluster more closely than different species A heatmap samples of different regions of the brain in humans and chimpanzees gene clusters show that some genes differentiate among brain regions while other differentiate the 2 species ☺
SNP Microarray experiments obtain numerical summary for each SNP estimate SNP frequency haplotyping association with subpopulation
SNP Microarray experiments obtain numerical summary for each SNP estimate SNP frequency binomial distribution determine which sets of SNPs come from each of the 2 chromosomes haplotyping association with disease, ecotype, etc multivariate analysis mixture models association with subpopulation
Tiling Microarray experiments obtain numerical summary for each codon on the chromosome visualization
Tiling Microarray experiments obtain numerical summary for each codon on the chromosome nonparametric smoothing visualization
p >> n n=#samples Usually, we have some type of response on the samples which may be quantitative (e.g. body mass index, HDL) or categorical (cancer type, growth stage ...) p=#measurements which may be the intensity per gene, exon, locus, promoter region ...
p >> n n=#samples Call the response Y typically a n x 1 vector (e.g. BMI) p=#measurements Call the measurements X, an n x p matrix
p >> n n=#samples Call the response Y typically a n x 1 vector (e.g. BMI) p=#measurements Call the measurements X, an n x p matrix. Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables. e.g. Y=Ub + e where U is the n x k matrix of measurements, b is an unknown vector of constants and e is random.
p >> n n=#samples Call the response Y typically a n x 1 vector (e.g. BMI) p=#measurements Call the measurements X, an n x p matrix. Typically we might think that Y is related to a small set of the X measures, e.g. k<<n of the p variables. e.g. Y=Ub + e where U is the n x k matrix of measurements, b is an unknown vector of constants and e is random. If we try to solve Y=Xb and X has rank n, we will always find an exact solution. In fact, if we select any submatrix of columns of X with rank n, we will always find an exact solution, even if those columns are completely independent of U.
p >> n Another approach is to try to predict using 1 column of X at a time. If none of the columns are in U (so that the corresponding coefficients are 0), then, if we do any statistical test for b=0 and reject for p-value <a, we will reject ap of the tests and conclude that the corresponding spots are associated with Y. Because usually p>10000, we will make a lot of mistakes unless a is extremely small.
p >> n All of the special methods for analysis of gene expression data are developed to solve the p >> n problem.
p >> n e.g. 2-sample problem 2 conditions: e.g. cancer normal For gene (exon, locus ...) i, we have n samples with p genes and observe Yijk i = gene id j= condition id k=sample id Usual method: 2-sample t-test:
p >> n e.g. 2-sample problem Some ideas to improve selection of differentially expressed genes: 1) Force all genes to have the same variability (2-way ANOVA) by the normalization step. 2) assume that there is a distribution of gene means known in advance or estimated from the data (Bayesian or empirical Bayes methods). 3) Use the data to estimate the number of inference errors. 4) Force the data to be normally distributed (within gene) in the normalization step or use bootstrap or permutation methods (suitable for fairly large sample size).
Unsolved Problems People are still working on normalization, differential expression analysis, clustering and classification for gene expression arrays. There are also problems in combining data from other sources including measurements from other platforms, meta-analysis, and data from the literature. These problems are not dead, but it will be increasingly difficult to find new problems without a paradigm change. The new arrays (exon, SNP, tiling) will need more new methodology.
Acknowledgements Bioinformatics Consulting Center Huck Institute Diya Zhang Wenlei Liu Qing Zhang Allison Lab (UAB) Francesca Chiaromonte Floral Genome Project dePamphilis Lab Ma Lab Carlson Lab McNellis Lab Pugh Lab Fedorov Lab Tony Hua Xianyun Mao