1k likes | 1.27k Views
DNA Chips and Their Analysis Comp. Genomics: Lectures 10-11 based on many sources, primarily Zohar Yakhini. DNA Microarras: Basics. What are they. Types of arrays (cDNA arrays, oligo arrays). What is measured using DNA microarrays. How are the measurements done?.
E N D
DNA Chips and Their AnalysisComp. Genomics: Lectures 10-11based on many sources, primarily Zohar Yakhini
DNA Microarras: Basics • What are they. • Types of arrays (cDNA arrays, oligo arrays). • What is measured using DNA microarrays. • How are the measurements done?
DNA Microarras: Computational Questions • Design of arrays. • Techniques for analyzing experiments. • Detecting differential expression. • Similar expression: Clustering. • Other analysis techniques (mmmmmany). • Machine learning techniques, and applications for advanced diagnosis.
What is a DNA Microarray (I) • A surface (nylon, glass, or plastic). • Containing hundreds to thousand pixels. • Each pixel has copies of a sequence of single stranded DNA (ssDNA). • Each such sequence is called a probe.
What is a DNA Microarray (II) • An experiment with 500-10k elements. • Way to concurrently explore the function of multiple genes. • A snapshot of the expression level of 500-10k genes under given test conditions
Some Microarray Terminology • Probe: ssDNA printed on the solid substrate (nylon or glass). These are short substrings of the genes we are going to be testing • Target: cDNA which has been labeled and is to be washed over the probe
Back to Basics: Watson and Crick James Watson and Francis Crick discovered, in 1953, the double helix structure of DNA. From Zohar Yakhini
AATGCTTAGTC TTACGAATCAG AATGCGTAGTC TTACGAATCAG Perfect match One-base mismatch Watson-Crick Complimentarity A binds to T C binds to G From Zohar Yakhini
Array Based Hybridization Assays (DNA Chips) • Array of probes • Thousands to millions of differentprobe sequences per array. Unknown sequence or mixture (target).Many copies. From Zohar Yakhini
Array Based Hyb Assays • Target hybs to WC complimentary probes only • Therefore – the fluorescence pattern is indicative of the target sequence. From Zohar Yakhini
Transcription Translation mRNA Protein Central Dogma of Molecular Biology(reminder) Cells express different subset of the genes in different tissues and under different conditions Gene (DNA) From Zohar Yakhini
Expression Profiling on MicroArrays • Differentially label the query sample and the control (1-3). • Mix and hybridize to an array. • Analyze the image to obtain expression levels information. From Zohar Yakhini
Microarray: 2 Types of Fabrication • cDNA Arrays: Deposition of DNA fragments • Deposition of PCR-amplified cDNA clones • Printing of already synthesized oligonucleotieds • Oligo Arrays: In Situ synthesis • Photolithography • Ink Jet Printing • Electrochemical Synthesis By Steve Hookway lecture and Sorin Draghici’s book “Data Analysis Tools for DNA Microarrays”
cDNA Microarrays vs. Oligonucleotide Probes and Cost By Steve Hookway lecture and Sorin Draghici’s book “Data Analysis Tools for DNA Microarrays”
Photolithography (Affymetrix) • Similar to process used to generate VLSI circuits • Photolithographic masks are used to add each base • If base is present, there will be a “hole” in the corresponding mask • Can create high density arrays, but sequence length is limited Photodeprotection mask C From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Photolithography (Affymetrix) From Zohar Yakhini
Ink Jet Printing • Four cartridges are loaded with the four nucleotides: A, G, C,T • As the printer head moves across the array, the nucleotides are deposited in pixels where they are needed. • This way (many copies of) a 20-60 base long oligo is deposited in each pixel. By Steve Hookway lecture and Sorin Draghici’s book “Data Analysis Tools for DNA Microarrays”
C T A G Ink Jet Printing (Agilent) The array is a stack of images in the colors A, C, G, T. … From Zohar Yakhini
Inkjet Printed Microarrays Inkjet head, squirting phosphor-ammodites From Zohar Yakhini
Electrochemical Synthesis • Electrodes are embedded in the substrate to manage individual reaction sites • Electrodes are activated in necessary positions in a predetermined sequence that allows the sequences to be constructed base by base • Solutions containing specific bases are washed over the substrate while the electrodes are activated From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici
Expression Profiling on MicroArrays • Differentially label the query sample and the control (1-3). • Mix and hybridize to an array. • Analyze the image to obtain expression levels information. From Zohar Yakhini
Expression Profiling: a FLASH Demo URL: http://www.bio.davidson.edu/courses/genomics/chip/chip.html
Expression Profiling – Probe Design Issues • Probe specificity and sensitivity. • Special designs for splice variations or other custom purposes. • Flat thermodynamics. • Generic and universal systems From Zohar Yakhini
Hybridization Probes • Sensitivity:Strong interaction between the probe and its intended target, under the assay's conditions.How much target is needed for the reaction to be detectable or quantifiable? • Specificity:No potential cross hybridization. From Zohar Yakhini
Specificity • Symbolic specificity • Statistical protection in the unknown part of the genome. Methods, software and application in collaboration with Peter Webb, Doron Lipson. From Zohar Yakhini
Reading Results: Color Coding • Numeric tables are difficult to read • Data is presented with a color scale • Coding scheme: • Green = repressed (less mRNA) gene in experiment • Red = induced (more mRNA) gene in experiment • Black = no change (1:1 ratio) • Or • Green = control condition (e.g. aerobic) • Red = experimental condition (e.g. anaerobic) • We usually use ratio Campbell & Heyer, 2003
Thermal Ink Jet Arrays, by Agilent Technologies In-Situ synthesized oligonucleotide array. 25-60 mers. cDNA array, Inkjet deposition
Application of Microarrays • We only know the function of just about 30% of the 30,000 genes in the Human Genome • Gene exploration • Functional Genomics • DNA microarrays are just the first among many high throughput genomic devices (first used approx. 1996) http://www.gene-chips.com/sample1.html By Steve Hookway lecture and Sorin Draghici’s book “Data Analysis Tools for DNA Microarrays”
A Data Mining Problem • On a given microarray, we test on the order of 10k elements in one time • Number of microarrays used in typical experiment is no more than 100. • Insufficient sampling. • Data is obtained faster than it can be processed. • High noise. • Algorithmic approaches to work through this large data set and make sense of the data are desired.
Informative Genes in aTwo Classes Experiment • Differentially expressed in the two classes. • Identifying (statistically significant) informative genes • - Provides biological insight • - Indicate promising research directions • - Reduce data dimensionality • - Diagnostic assay From Zohar Yakhini
Informative genes+ + + + + + + + - - - - - - -- - - - - - -+ + + + + + + + - - - - + - -+ + - + + + + + etc Non-informative genes + - + - + + + + - - + + - - -- + + - + - -+ + - + + - - + + - - - + + -+ + - + + - + - etc Scoring Genes Expression pattern and pathological diagnosis information (annotation), for a single gene + + - - + + + - - + - - + + - a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 Permute the annotation by sorting the expression pattern (ascending, say). From Zohar Yakhini
Separation Score • Compute a Gaussian fit for each class (1 , 1) , (2 , 2) . • The Separation Score is(1 - 2)/(1+ 2)
6 7 # of errors = min(7,8) = 7. Not informative Ex 2: A perfect single gene classifier gets a score of 0. Very informative + + + + + + + + - - - - - - - 0 Threshold Error Rate (TNoM) Score Find the threshold that best separates tumors from normals, count the number of errors committed there. Ex 1: - + + - + - -+ + - + + - - + From Zohar Yakhini
p-Values • Relevance scores are more useful when we can compute their significance: • p-value: The probability of finding a gene with a given score if the labeling is random • p-Values allow for higher level statistical assessment of data quality. • p-Values provide a uniform platform for comparing relevance, across data sets. • p-Values enable class discovery From Zohar Yakhini
BRCA1 Differential Expression Genes over-expressed in BRCA1 mutants Genes under-expressed in BRCA1 mutants Collab with NIH NEJM 2001 Sporadic sample s14321 With BRCA1-mutant expression profile BRCA1 mutants From Zohar Yakhini BRCA1 Wildtype
Lung Cancer Informative Genes Data from Naftali Kaminski’s lab, at Sheba. • 24 tumors (various types and origins) • 10 normals (normal edges and normal lung pools) From Zohar Yakhini
And Now: Global Analysisof Gene Expression Data Most common tasks: Construct gene network from experiments. Cluster - either genes, or experiments
And Now: Global Analysisof Gene Expression Data Most common tasks: Construct gene network from experiments. Cluster - either genes, or experiments
And Now: Global Analysisof Gene Expression Data Most common tasks: Construct gene network from experiments. Cluster - either genes, or experiments
Pearson Correlation Coefficient, r. Values are in [-1,1] interval • Gene expression over d experiments is a vector in Rd, e.g. for gene C: (0, 3, 3.58, 4, 3.58, 3) • Given two vectors X and Y that contain N elements, we calculate r as follows: Cho & Won, 2003
Intuition for Pearson Correlation Coefficient r(v1,v2) close to 1: v1, v2 highly correlated. r(v1,v2) close to -1: v1, v2 anti correlated. r(v1,v2) close to 0: v1, v2 not correlated.
Pearson Correlation and p-Values When entries in v1,v2 are distributed according to normal distribution, can assign (and efficiently compute) p-Values for a given result. These p-Values are determined by the Pearson correlation coefficient, r, and the dimension, d, of the vectors. For same r, vectors of higher dimension will be assigned more significant (smaller) p-Value.
Spearman Rank Order Coefficient(a close relative of Pearson, non parametric) • Replace each entry xi by its rank in vector x. • Then compute Pearson correlation coefficients of rank vectors. • Example: X = Gene C = (0, 3.00, 3.41, 4, 3.58, 3.01) Y = Gene D = (0, 1.51, 2.00, 2.32, 1.58, 1) • Ranks(X)= (1,2,4,6,5,3) • Ranks(Y)= (1,3,5,6,4,2) • Ties should be taken care of, but: (1) rare (2) can randomize (small effect)
From Pearson Correlation Coefficients to a Gene Network • Compute correlation coefficient for all pairs of genes (what about missing data?) • Choose p-Value threshold. • Put an edge between gene i and gene j iff p-Value exceeds threshold.
Things May Get Messy • What to do with significant yet negative correlation coefficients? Usually care only about the p-value and put a “normal edge” • Cases composed of multiple experiments where distribution is far from normal.
What to do when things Get Messy • Create a single vector of all experiments • per gene. Compute correlations based on • thesevectors. This is the common • approach. • Disadvantage: Outcome is dominated by • the larger experiments.
What to do when things Get Messy 2) For each edge, count the no. of experiments where it appears significantly. Take edges exceeding some threshold. Disadvantage: Outcome is somewhat dominated by experiments with many significant correlations.