320 likes | 338 Views
Introduction to Microarray Dr G. P. S. Raghava. Molecular Biology Overview. Nucleus. Cell. Chromosome. Gene (DNA). Gene (mRNA), single strand. Protein. Measuring Gene Expression.
E N D
Introduction to Microarray Dr G. P. S. Raghava
Molecular Biology Overview Nucleus Cell Chromosome Gene (DNA) Gene (mRNA), single strand Protein
Measuring Gene Expression Idea: measure the amount ofmRNAto see whichgenesare beingexpressedin (used by) the cell. Measuringproteinwould be more direct, but is currently harder.
The Goals • Basic Understanding • Arrays can take a snap shot of which subset of genes in a cell is actively making proteins • Heat shock experiments • Medical diagnosis • Microarrays can indicate where mutations lie that might be linked to a disease. Still others are used to determine if a person’s genetic profile would make him or her more or less susceptible to drug side effects • 1999 – A genechip containing 6800 human genes was used distinguish between myeloid leukemia and lympholastic leukemia using a set of 50 genes that have different activity levels • Drug design • Pharmaceutical firms are in a rush to translate the human genome results into new products • Potential profits are huge • First, though, they must figure out what the genes do, how they interact, and how they relate to diseases. • Evaluation, Specificity, Response
Microarray Potential Applications • Biological discovery • new and better molecular diagnostics • new molecular targets for therapy • finding and refining biological pathways • Recent examples • molecular diagnosis of leukemia, breast cancer, ... • appropriate treatment for genetic signature • potential new drug targets
History 1980s: antibody-based assay (protein chip?) ~1991: high-density DNA-synthetic chemistry (Affymetrix/oligo chips) ~1995: microspotting (Stanford Univ/cDNA chips) replacing porous surface with solid surface replacing radioactive label with fluorescent label improvement on sensitivity
What is a DNA Microarray? genes or gene fragments attached to a substrate (glass) Tens of thousands of spots/genes =entire genome in 1 experiment A Revolution in Biology Hybridized slide Two dyes Image analyzed
Gene Expression Microarrays The main types of gene expression microarrays: • Short oligonucleotide arrays (Affymetrix); • cDNA or spotted arrays (Brown/Botstein). • Long oligonucleotide arrays (Agilent Inkjet);
Stanford/cDNA chip one slide/experiment one spot 1 gene => one spot or few spots(replica) control: control spots control: two fluorescent dyes (Cy3/Cy5) Affymetrix/oligo chip one chip/experiment one probe/feature/cell 1 gene => many probes (20~25 mers) control: match and mismatch cells. Terms/Jargons
50um Affymetrix Microarrays Raw image 1.28cm ~107 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Raw gene expression is intensity difference: PM - MM
GeneChip DNA Microarrays • Each probe consists of thousands of strands of identical oglionucleotides • The DNA sequences at each probe represent important genes (or parts of genes) • Printing Systems • Ex: HP, Corning Inc. • Printing systems can build lengths of DNA up to 60 nucleotides long • 1.28 x 1.28+ cm glass wafer • Each “print head” has a ~100 m diameter and are separated by ~100 m. ( 5,000 – 20,000 probes) • Photolithographic Chips • Ex: Affymetix • 1.28 x 1.28 cm glass/silicon wafer • 24 x 24 m probe site ( 500,000 probes) • Lengths of DNA up to 25 nucleotides long • Requires a new set of masks for each new array type
Poly-A RNA 10% Biotin-labeled Uracil Antisense cRNA cDNA IVT AAAA Cells L L L L L L Fragment (heat, Mg2+) Labeled fragments Hybridize Wash/stain Scan The Process (In-vitro Transcription)
L L L L L L L L L L L L Hybridization and Staining Biotin Labeled cRNA GeneChip Hybridized Array + + SAPE Streptavidin- phycoerythrin
Microarray Data • First, the Problems: • The fabrication process is not error free • Probes have a maximum length 25-60 nucleotides • Biologic processes such as hybridization are stochastic • Background light may skew the fluorescence • How do we decide if/how strongly a particular gene is being expressed? • Solutions to these problems are still in their infancy
Affymetrix “Gene chip” system • Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) • RNA labeled and scanned in a single “color” • one sample per chip • Can have as many as 20,000 genes on a chip • Arrays get smaller every year (more genes) • Chips are expensive • Proprietary system: “black box” software, can only use their chips
cDNA Microarray Technologies • Spot cloned cDNAs onto a glass microscope slide • usually PCR amplified segments of plasmids • Label 2 RNA samples with 2 different colors of flourescent dye - control vs. experimental • Mix two labeled RNAs and hybridize to the chip • Make two scans - one for each color • Combine the images to calculate ratios of amounts of each RNA that bind to each spot
cDNA microarrays PRINT cDNA from one gene on each spot SAMPLES cDNA labelled red/green Compare the genetic expression in two samples of cells e.g. treatment/control normal / tumor tissue
HYBRIDIZE Add equal amounts of labelled cDNA samples to microarray. SCAN Laser Detector
“Long Oligos” • Like cDNAs, but instead of using a cloned gene, design a 40-70 base probe to represent each gene • Relies on genome sequence database and bioinformatics • Reduces cross hybridization • Cheaper and possibly more sensitive than Affy. system
Images from scanner • Resolution • standard 10m [currently, max 5m] • 100m spot on chip = 10 pixels in diameter • Image format • TIFF (tagged image file format) 16 bit (65’536 levels of grey) • 1cm x 1cm image at 16 bit = 2Mb (uncompressed) • other formats exist e.g.. SCN (used at Stanford University) • Separate image for each fluorescent sample • channel 1, channel 2, etc.
Processing of images • Addressing or gridding • Assigning coordinates to each of the spots • Segmentation • Classification of pixels either as foreground or as background • Intensity determination for each spot • Foreground fluorescence intensity pairs (R, G) • Background intensities • Quality measures
Images in analysis software • The two 16-bit images (Cy3, Cy5) are compressed into 8-bit images • Display fluorescence intensities for both wavelengths using a 24-bit RGB overlay image • RGB image : • Blue values (B) are set to 0 • Red values (R) are used for Cy5 intensities • Green values (G) are used for Cy3 intensities • Qualitative representation of results
Pseudo-colour overlay Cy3 Cy5 Images : examples
Quantification of expression For each spot on the slide we calculate Red intensity = Rfg - Rbg (fg = foreground, bg = background) and Green intensity = Gfg - Gbg and combine them in the log (base 2) ratio Log2(Red intensity / Green intensity)
Gene Expression Data Slides On p genes for n slides: p is O(10,000), n is O(10-100), but growing, slide 1 slide 2 slide 3 slide 4 slide 5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene 5 in slide 4 = Log2(Red intensity / Green intensity) These values are conventionally displayed on a red(>0)yellow (0)green (<0) scale.
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment 16-bit TIFF files Image analysis (Rfg, Rbg), (Gfg, Gbg) Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation
Quality control (-> Flag) • How good are foreground and background measurements ? • Variability measures in pixel values within each spot mask • Spot size • Circularity measure • Relative signal to background intensity • Dapple: • b-value : fraction of background intensities less than the median foreground intensity • p-score : extend to which the position of a spot deviates from a rigid rectangular grid • Flag spots based on these criteria
Replication • Why? • To reduce variability • To increase generalizability • What is it? • Duplicate spots • Duplicate slides • Technical replicates • Biological replicates
Practical Application of DNA Microarrays • DNA Microarrays are used to study gene activity (expression) • What proteins are being actively produced by a group of cells? • “Which genes are being expressed?” • How? • When a cell is making a protein, it translates the genes (made of DNA) which code for the protein into RNA used in its production • The RNA present in a cell can be extracted • If a gene has been expressed in a cell • RNA will bind to “a copy of itself” on the array • RNA with no complementary site will wash off the array • The RNA can be “tagged” with a fluorescent dye to determine its presence • DNA microarrays provide a high throughput technique for quantifying the presence of specific RNA sequences
Analysis and Management of Microarray Data • Magnitude of Data • Experiments • 50 000 genes in human • 320 cell types • 2000 compunds • 3 times points • 2 concentrations • 2 replicates • Data Volume • 4*1011 data-points • 1015 = 1 petaB of Data