390 likes | 581 Views
Gene Array Analysis. Statistical genetics - Class 10 Gene array description Normalization Data Analysis Multiple measurements. What is a gene array.
E N D
Gene Array Analysis Statistical genetics - Class 10 Gene array description Normalization Data Analysis Multiple measurements
What is a gene array • Gene arrays are solid supports upon which a collection of gene-specific nucleic acids have been placed at defined locations, either by spotting or direct synthesis. • In array analysis, a nucleic acid-containing sample is labeled and then allowed to hybridize with the gene-specific targets on the array. • Based on the amount of probe hybridized to each target spot, information is gained about the specific nucleic acid composition of the sample. • The major advantage of gene arrays is that they can provide information on thousands of targets in a single experiment.
Nomenclature • Many terms exist for naming gene arrays, including: • biochip, • DNA chip, • GeneChip (a registered trademark of Affymetrix, Inc.), • DNA array, • microarray • macroarray. • Microarray and macroarray may be used to differentiate between spot size or the number of spots on the support. Glass Support
Experiment • A typical gene array experiment involves: • Isolating RNA from the samples to be compared • Converting the RNA samples to labeled cDNA via reverse transcription; this step may be combined with aRNA amplification • Hybridizing the labeled cDNA to identical membrane or glass slide arrays • Removing the unhybridized cDNA • Detecting and quantitating the hybridized cDNA • Comparing the quantitative data from the various samples
Choosing Cell Populations • The goal of comparative cDNA hybridization is to compare gene transcription in two or more different kinds of cells. For example: • Tissue-specific Genes - Cells from two different tissues (say, cardiac muscle and prostate epithelium) are specialized for performing different functions in an organism. Although we can recognize cells from different tissues by their phenotypes, it is not known just what makes one cell function as smooth muscle, another as a neuron, and still another as prostate. • Ultimately, a cell's role is determined by the proteins it produces, which in turn depend on its expressed genes. Comparative hybridization experiments can reveal genes which are preferentially expressed in specific tissues.
Choosing Cell Populations • Genetic disease is often caused by genes which are inappropriately transcribed -- either too much or too little -- or which are missing altogether. • Such defects are especially common in cancers, which can occur when regulatory genes are deleted, inactivated, or become constitutively active. • Unlike some genetic diseases (e.g. cystic fibrosis) in which a single defective gene is always responsible, cancers which appear clinically similar can be genetically heterogeneous. • For example, prostate cancer (prostatic adenocarcinoma) may be caused by several different, independent regulatory gene defects even in a single patient.
Choosing Cell Populations • Cell Cycle Variations • Cells undergo DNA replication, mitosis, and eventually death. These activities require quite different gene products, such as DNA polymerases for genome replication or microtubule spindle proteins for mitosis. A cell's genes encode the "programs" for these activities, and gene transcription is required to execute those programs. Comparative hybridization can be used to distinguish genes that are expressed at different times in the cell cycle. In this way, the pathways responsible for controlling basic life processes can be uncovered.
mRNA Extraction • Genes which code for protein are transcribed into messenger RNA's (mRNA's) in the cell nucleus. The mRNA's in turn are translated into proteins by ribosomes in the cytoplasm. The transcription level of a gene is taken to be the amount of its corresponding mRNA present in the cell. Comparative hybridization experiments compare the amounts of many different mRNA's in two cell populations.
mRNA Extraction • To prepare mRNA for use in a microarray assay, it must be purified from total cellular contents. mRNA accounts for only about 3% of all RNA in a cell. • Common mRNA isolation methods take advantage of the fact that most mRNA's have a poly-adenine (poly(A)) tail. These poly(A)+ mRNA's can be purified by capturing them using complementary oligodeoxythymidine (oligo(dT)) molecules bound to a solid support.
Reverse transcription • Captured mRNA's are still difficult to work with because they are prone to being destroyed. • The environment is full of RNA-digesting enzymes, so free RNA is quickly degraded. To prevent the experimental samples from being lost, they are reverse-transcribed back into more stable DNA form. The products of this reaction are called complementary DNA's (cDNA's) because their sequences are the complements of the original mRNA sequences.
Reverse transcription • A problem with cDNA production is that not all mRNA's are reverse-transcribed with the same efficiency. This fact leads to reverse transcription bias, which can change the relative amounts of different cDNA's measured by the microarray assay. • Reverse transcription bias is not a problem when comparing the same mRNA across two cell populations unless it causes the mRNA not to be transcribed at all. • However, the bias does prohibit quantitative comparison between different mRNA's on one array.
Fluorescent labeling of cDNA's • In order to detect cDNA's bound to the microarray, we must label them with a reporter molecule that identifies their presence. The reporters currently used in comparative hybridization to microarrays are fluorescent dyes (fluors). • A differently-colored fluor is used for each sample so that we can tell the two samples apart on the array. The labeled cDNA samples are called probes because they are used to probe the collection of spots on the array. • Fluors do not show their colors unless stimulated with a specific frequency of light by a laser. Even then, the colors are not directly observed; rather, the wavelength of the emitted light is used to tune a detector which measures the fluorescence.
Normalization • The number of fluor molecules which label each cDNA depends on its length and possibly its sequence composition, both of which are often unknown. • This is one more reason that fluorescent intensities for different cDNA's cannot be quantitatively compared. However, identical cDNA's from the two probes are still comparable as long as the same number of label molecules are added to the same DNA sequence in each probe.
Normalization • To equalize the total concentrations of the two cDNA probes before applying them to the array, the probe solutions are diluted to have the same overall fluorescent intensity. • This procedure makes two possibly unjustified assumptions: • that the total amount of mRNA in each cell type being tested is identical • that each fluor emits the same amount of light relative to its concentration.
Hybridization to a DNA Microarray • The two cDNAprobes are tested by hybridizing them to a DNA microarray. • The array holds hundreds or thousands of spots, each of which contains a different DNA sequence. • In this way, every spot on an array is an independent assay for the presence of a different cDNA. There is enough DNA on each spot that both probes can hybridize to it at once without interference. • Microarrays are made from a collection of purified DNA's. A drop of each type of DNA in solution is placed onto a specially-prepared glass microscope slide by an arraying machine. The arraying machine can quickly produce a regular grid of thousands of spots in a square about 2 cm on a side
Scanning the Hybridized Array • Once the cDNAprobes have been hybridized to the array and any loose probe has been washed off, the array must be scanned to determine how much of each probe is bound to each spot. • The probes are tagged with fluorescent reporter molecules which emit detectable light when stimulated by a laser. • The emitted light is captured by a detector,usualy a charge-coupled device (CCD). • Spots with more bound probe will have more reporters and will therefore fluoresce more intensely. • The scanner also records light from a few molecules that hybridized either to the wrong spot or nonspecifically to the glass slide. This extra light becomes the background of the scanned array image.
Affymetrix arrays • 107copiesper oligo in 24 x 24 um square • Use 20 pairs of different 25-mers per gene • Perfect match and mismatch
Data Analysis • Normalization • Detection of outliers • Clustering • Multiple measurments
False color images of spotted array • Overlay of two scans of the slide • Compares the two samples • Green = less relative expression • Red = more relative expression • Yellow = equal expression • Dimmer colors = lower expression levels.
Normalizing two-color arrays The signals for the two colors are rarely “balanced”. after before
Normalization Cy5 signal (log2) Cy3 signal (log2)
then apply slope and intercept to the original dataset repeat until r2 changes by < 0.001 Normalization by iterative linear regression fit a line (y=mx+b) to the data set set aside outliers (residuals > 2 x s.e.) D Finkelstein et al. http://www.camda.duke.edu/CAMDA00/abstracts.asp
Normalization (Linear) Cy5 signal (log2) Cy3 signal (log2)
Normalization (Linear) Cy5 signal (log2) Cy3 signal (log2)
Normalization (Curvilinear) G Tseng et al., NAR 2001
LOESS function To use LOESS, the user must specify the degree, d, of the local polynomial to be fit to the data, and the fraction of the data, q, to be used in each fit. In this case, the simplest possible initial function specification is d=1 and q=1. While it is relatively easy to understand how the degree of the local polynomial affects the simplicity of the initial model, it is not as easy to determine how the smoothing parameter affects the function.
LOESS function The weight function gives the most weight to the data points nearest the point of estimation and the least weight to the data points that are furthest away. The use of the weights is based on the idea that points near each other in the explanatory variable space are more likely to be related to each other in a simple way than points that are further apart. Following this logic, points that are likely to follow the local model best influence the local model parameter estimates the most. Points that are less likely to actually conform to the local model have less influence on the local model parameter estimates. The traditional weight function used for LOESS is the tri-cube weight function,
Image Analysis • 2 images per array • Super-imposing • Grid on image
low expression level high Gene Ratios • Gene expression levels determined by intrinsic properties of each gene Gene A Gene B
Statistical Analysis • Differences in ratios due to • random variation • meaningful changes • Hypothesis testing, with H0: no systematic differences between ratios
Most Basic Statistical Analysis • Assumptions • ‘red’ and ‘green’ intensities at a given gene ~ i.i.N.d with common variance • constant coefficient of variation over the whole gene set
Statistical Analysis According to Chen et al. 1997(J Biomedical Optics, 2(4):364) with Tk = Rk / Gk , with c: coefficient of variation, estimated from data
under-expressed over-expressed /2 /2 3 classes of genes Statistical Analysis • Classification with hypothesis testing
Fold Change Graphs • How many times did the expression of this gene change in the treated tissue versus the control? • comparison analysis • requires experiment vs control • does not apply to absolute analysis • parameter value in one vs another • Avg diff (perfect match vs mismatch)
Noise and Repeats log – log plot • >90% 2 to 3 fold • Multiplicative noise • Repeat experiments • Log scaledist(4,2)=dist(2,1)