130 likes | 350 Views
Basic Introduction to Microarrays. Chitta Baral Arizona State University Feb 3, 2003. Basics of Microarrays. From: The magic of microarrays, S. Friend and R. Stoughton (Scientific American, Feb 2002) A microarray has thousands of spots (or array elements)
E N D
Basic Introduction to Microarrays Chitta Baral Arizona State University Feb 3, 2003
Basics of Microarrays • From: The magic of microarrays, S. Friend and R. Stoughton (Scientific American, Feb 2002) • A microarray has thousands of spots (or array elements) • Each spot has thousands to millions of copies of a particular single stranded DNA representing a particular gene. • Thousands of genes are each assigned a unique spot. • From 2 cell samples (say one treated with a drug and another untreated) collect mRNA, make more stable cDNA from them and add fluorescent labels (green to untreated and red to treated). • Apply the labeled cDNAs to the chip. • Binding occurs when cDNA from a sample finds its complementary sequence of bases on the chip. • Such binding means that the gene represented by the chip DNA was active or expressed in the sample. • Put the chip in a scanner. • Calculate the ratio of red to green at each spot and generate a color coded readout. • Red : Gene that strongly increased activity in treated cells • Green: Gene that strongly decreased activity in treated cells • Yellow: Gene that was equally active in treated and untreated cells. • Black: Gene that was inactive in both groups.
Determining the impact of new drug • Suppose the last experiment was about determining quickly whether a potential new drug is likely to harm the liver. • Are the red or green genes correspond to ways (gene functions) that reflect liver damage. I.e., do those genes make proteins whose concentration (high for the green ones, and low for the red ones) reflects liver damage? • Or compare the overall expression pattern with the patterns produced when those genes (? cells) react to known liver toxins. • Close similarity would indicate that the new drug is probably toxic as well.
Application of comparative cDNA hybridization • Tissue-specific genes • Comparative hybridization experiments (CHE) can reveal genes which are preferentially expressed in specific tissues. Some of the genes implement the behaviors that distinguish the cell’s tissue type; while other controlling genes make sure that the cell only performs the function for its type. • Regulatory gene defects in cancer • CHE can pinpoint the transcription differences responsible for the change from normal to cancerous cells • CHE can distinguish different patterns of abnormal transcription in heterogenous cancers. • Cellular responses to the environment • CHE can point to genes whose transcription changes in response to an environmental stimulus. • Temporal studies can also identify the order of changes providing evidence about which genes control the response directly and which are only indirectly affected by it. • Cell cycle variations • CHE can be used to distinguish genes that are expressed at different times in the cell cycle. Thus pathways responsible for basic life processes can be uncovered.
Microarray applications and uses • Characterization of the temporal order of gene expression within a cell • Determination of the cellular location of gene products • Prediction of the function of resulting proteins • Prediction of the effect of perturbations of the cellular environment on the program of gene expression by changing environmental conditions or administering drugs.
Computational challenges – interpreting the scanned image • End point of a CHE is a scanned array image. • Intensities can be quantified by measuring the average or integrated intensities of the spots • Subject to noise from irregular spots, dust on the slide, and nonspecific hybridization. • Deciding the threshold (between spots and background) can be difficult, especially when the spots fade gradually around the edges. • Detection efficiency may not be uniform across the slide, leading to excessive red intensity on one side and excessive green on the other. • Ratio of fluoroscent intensities for a spot is interpreted as the ratio of concentrations for its corresponding mRNA in the two cell populations. • Low levels of cDNA due to reverse transcription bias, sample loss, or an inherently rare mRNA can cause large uncertainties in these ratios.
A typical Microarray data set • Includes expression levels for thousands of genes across hundreds of conditions such as • From cells of different cell lines • From cells under different conditions • Pathological tissue specimens from different patients • Serial time points following a stimulus to a cell or organism. • Imagine a 2D array of measurements • Rows: measurements associated with individual genes • Columns: measurements associate with conditions • Profile: list of measurements along each row or column • Features: individual expression measurements within each profile. (some features more valuable than others; and sometimes focusing on a subset improves results)
Analyzing microarray data • Any dataset can be analyzed in 2 ways. • Eg. 47 expression profiles of 4026 genes collected from lymphoma specimens (Nature 403, 503-511, 2000) • 47 cancer profiles with 4026 available features • 4026 gene profiles with 47 available features • Supervised and unsupervised methods • Supervised: the genes or conditions are associated with labels – coming from outside the experiment --that provide information about a preexisting classification. The information may include knowledge of gene function or regulation, disease sub type or tissue origin of a cell type. • Classification information is used to drive the analysis • Used for predicting accurate labels for new genes. • Unsupervised: No additional information • Geared towards the discovery of patterns in the data, unbiased by outside knowledge. • Used for exploratory tasks. • Clustering
Unsupervised grouping: clustering • Goal: Simplify large gene expression data sets • Approach: Group similar profiles together based on a distance metric (a formula for calculating the similarity of two profiles) • Distance metrics: distance between 2 list of numbers • Euclidean distance (sqr root of the sum of squared differences) • Statistical correlation coefficient (-1 to +1) • Clustering strategy • Hierarchical clustering: calculate the distance between individual data points and then group together that are close. • Distance between groups are computed and used to create groups of groups • Easy to implement but suffer because the decision about where to create branches and in what order to present them is often arbitrary. • K-means clustering: requires a parameter k (the # expected clusters) • Initially cluster centers are selected randomly • In each iteration of the algorithm, all of the profiles are assigned to clusters whose center they are nearest to, and then the cluster center is recalculated based on the profiles within the cluster. • Self-organizing maps: • Instead of partitioning, they organize the clusters into a map where similar clusters are close to each other. • No and topological configuration of the clusters are pre-specified • Cluster centers are recalculated in each iteration using both the profiles within the cluster as well as the profiles in adjacent clusters. • Clustering is sensitive to the features used to compute the distance metric. • Applications: identify co-regulated genes, genes with related functions, signatures of individual signalling pathways within the data set, etc.
Supervised grouping: classification • Approach: Take known groupings and create rules for reliably assigning genes or conditions to these groups. • Eg. Problem of classifying unknown genes as ribosomal or non-ribosomal • Success depends on whether high quality labeled sets are provided or not. • Examples of methods (Machine learning) • Logistic regression: • uses the feature value for different groups to estimate the parameters of a predictor function (a linear log-likelihood model) • Neural networks • Use a set of known examples to create a multi-layered computational network • Linear discriminant analysis • Use the labeled example from each set of classified cases to estimate a probability distribution for the values of the features in that set. • Given a new example, it determines the closest distribution and assigns the example to this set. • Inductive logic programming • Decision trees
Dimension reduction • Involves removing features from the data set • Removed because do not provide significant incremental information and can confuse and make analysis unnecessarily complex • Feature selection often attempts to identify a minimum set of non-redundant features that are useful for classification. • Eg. Don’t select co-regulated genes. • Unsupervised dimensional reduction: pruning uninformative features; several methods such as: • Principal component analysis (PCA) • Automatically detects redundancies in the data and determines a new set of guarentedly non-redundant hybrid (multiple features condensed) features. • Advantage: Makes apparent the outliers and clusters in a data set and reduces the noise in a data set • Disadvantage: Throwing away weak signals that could be but important • Independent component analysis • Supervised dimension reduction: feature selection • Goals: selecting relevant underlying features and reducing the number of features necessary to classify correctly • A straight forward method • Iteratively apply a supervised classification algorithm that reports weights on all features. • After running the classification algorithm the first time, the feature with the lowest weight is removed from the data set. • The algorithm is run again to determine the second least important feature. • This process is repeated while monitoring the classification performance on known examples.
Further Details … • Types of microarrays: spotted cDNA microarrays, high-density Oligonucleotide microarrays • Fluorescent dyes (for glass arrays) vs radioactive isotopes (for membrane arrays) • Interpretations • Identifying individual genes (regulated expression of which can explain particular biological phenomena) or assign potential function to new genes. • Co-regulated genes (often identified using cluster analysis) allow functional classification (may participate in similar cellular processes or pathways), • potential identification of common regulatory elements (DNA motifs) in promoter sequences. • Assumption: Genes with closely related expression patterns may be controlled by the same regulatory mechanism • When one sees differential expression they may have knowledge about the probable function (from NCBI databases) of that gene and can make a hypothesis about the role that gene is playing in their system. • One cluster of genes related to one pathway, another to another pathway; hint about interconnection between such pathways • Gene regulatory networks • Comparing large number of samples for a global view • Common control can be mixture of all samples used • Combining many data sets and analyzing the whole set is very useful • Comparing the expression profiles of tumour samples using many genes, it is possible to identify those genes whose expression characterizes a particular tumour type • Compare the expression signature of a particular tumour type to data generated by measuring the responses of closely related cell lines in culture to many different stimuli, such as hormones, growth factors, etc. Using this strategy one can draw conclusions about which signalling pathways are activated in a particular tumour type, leading to the identification of pathways that might provide therauptic targets.
References: main sources used in this presentation • Basic microarray analysis: grouping and feature reduction. Raychaudhuri et al.Trends in Biotechnology vol 19, No 5, May 2001 189 – 193. • Gene expression microarrays and the integration of biological knowledge. Noordewier and Warren. Trends in Biotechnology vol 19, No 10, Oct 2001 412-415. • http://www.cs.wustl.edu/~jbuhler/research/array • The magic of microarrays. Friend and Stoughton. Scientific American. Feb 2002. • Other sources that I read. • Navigating gene expression using microarrays – a tech review. Schulze & Downward. Nature cell biology. Vol3, Aug 01. E190 • http://www.fargo.ars.usda.gov/ps/micr_the.htm • Analysis of gene expression by microarrays: cell biologist’s gold mine or minefield? Schulze & Downward. J. of Cell Science 113, 2000, 4151-4156. • Microarrays: handling the deluge of data and extracting reliable information. Hess et al. Trends in Biotechnology. Vol 19, No 11, Nov 2001, 463-468