1 / 58

Microarray - Introduction

Microarray - Introduction. Ka-Lok Ng Asia University. Topics to be covered. Introduction - RNA expression   Experimental design, image processing, Microarray databases   Data normalization, filter and analysis MATLAB   Statistical analysis of gene expression data   Clustering methods

strom
Download Presentation

Microarray - Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarray - Introduction Ka-Lok Ng Asia University

  2. Topics to be covered • Introduction - RNA expression   • Experimental design, image processing, Microarray databases   • Data normalization, filter and analysis • MATLAB   • Statistical analysis of gene expression data   • Clustering methods • Time series data (cell cycle) and dynamics programming • Gene regulatory networks  • Gene regulatory networks and protein-protein interaction networks

  3. 40% on quiz, classwork, homework, class attendance  • Mid-term – 30%,final exam. – 30%, oral presentation

  4. References • Causton H., Quackenbush J., and Brazma A. Microarray Gene Expression Data Analysis. A Beginner’s Guide. Blackwell (2003) • Baxevanis A. and Ouellette B.F. Francis. Bioinformatics Ch. 16. J. Wiley (2005) • Knudsen S. A Biologist’s Guide to Analysis of DNA Microarray Data. J. Wiley (2002) • Benfey P. and Protopapas A.D. Genomics Ch. 5. Prentice Hall (2005). • Setubal J. and Meidanis J. Introduction to computational molecular biology. PWS publishing. (1997). • A. Gu´enoche (2005). “about the design of oligo-chips”, Discrete Applied Mathematics, v147(1), pp.57-67.

  5. Contents • Introduction – the central dogma of molecular biology, applications, data analysis, Microarray slide surface • Printing technologies – spotting, photolithography, ink-jet • Selection of genes for spotting on arrays • Selection of primers for PCR – suffix tree • Microarray application - four different types of brain tumors • Gene co-expression and gene expression profile • Data management

  6. Knowledge is the process of piling up facts; wisdom lies in their simplification. Martin H. Fischer

  7. Introduction • The last 10 years have brought spectacular achievements in genome sequencing (such as the HGP) • It took >1000 years for science to progress from human anatomy to understand how genomes function) • Even if we assume all the genes have correctly identified, the results represents only sequence • High throughput DNA sequencing technology created a system approach to biology

  8. The central dogma of molecular biologyhttp://www.hort.purdue.edu/hort/courses/HORT250/lecture%2004 • Glossary • Transcripts – mRNA • Transcriptome – the complete set of transcripts • Hybridization

  9. Microarray technology allow one to identify the genes that are expressed in different cell types, to learn how their expression levels change in different developmental stages or disease states, and to identify the cellular processes in which they participate • Microarray technology provide clues about how genes and gene products interact and their interaction networks Microarray gene expression data analysis • Experimental design  data transformations from raw data to gene expression matrices  data mining and analysis of gene expression matrices

  10. What are microarrays and how do they work ? • A microarray is typically a glass or polymer slide • DNA molecules are attached at fixed locations called spots or features

  11. Smooth surface enables even deposition of surface chemistries and perfect spot morphology.

  12. What are microarrays and how do they work ? • ~10,000 spots on an array • each spot contains ~107 of identical DNA of lengths from 10s to 100s of bp • spots are either printed on the microarrays by a robot or jet, or synthesised by photolithography (石版影印術) or by inkjet printing Principle of cDNA microarraysEST fragments arrayed in 96- or 384-well plates are spotted at high density onto a glass microarray slide. Subsequently, two different fluorescently labeled cDNA populations derived from independent mRNA samples are hybridized to the array.

  13. (A) Tweezer or split-pin designs transfer low nanoliter (10-9 liter) amounts of DNA to the array by capillary action as the tip strikes the solid surface.(B) TeleChemTMtips and pins apply small droplets by contact between the pin and substrate.(C) The pin-and-loop design picks up the DNA in a small loop, and a pin stamps solution on a slide at a uniform density.(D) Ink jetsspray picoliter (10-12 liter) droplets of liquid under pressure.Robotic spotting, capillary action, the DNA sticks through hydrostatic interactionsThe spacing between spot centers is specified form 120-250 mm according to the density required. The entire microarray usually covers an area 2.5x5.0 cm, though shorter grids can be printed when fewer clones are to be represented. Types of printing pins

  14. DNA spotting I • DNA spotting usually uses multiple pins • DNA in microtiter plate • DNA usually PCR amplified • Oligonucleotides can also be spotted

  15. Commercial DNA spotter

  16. Oligonucleotide microarrays – pioneered by AffymetrixAffymetrix GeneChips • Oligonucleotides • Usually at least 20–25 bases in length, optimal with 45~60 bp long • 10–20 different oligonucleotides for each gene • Oligonucleotides for each gene selected by computer program to be the following: • Unique in genome (4 (20 to 25) =2(40to 50) >> 3*109 = 230), not likely to appear twice • Non-overlapping (if the sequence length is too short then specificity is low, whereas if the length is too long, self-hybridization could happen) • Composition based on design rules • Empirically derived rules (ratio of G-C pairs vs. A-T pairs which could affect the melting temperature of the seq., ie. Tm = 64.9+0.41*(GC%)-675/L)

  17. Construction of oligonucleotide arrays.Oligonucleotide are synthesized in situ in the silicon chip. (A) In each step, a flash of light “deprotects” the oligonucleotides at the desired location on the chip; then “protected” nucleotides of one of the four types (A, C, G or T) are added so that a single nucleotide can add to the desired chains. There are four types of masks according to the added nucleotide. Oligonucleotide microarrays – pioneered by Affymetrix

  18. Oligonucleotide microarrays Construction of oligonucleotide arrays.The light flash is produced by photolithography using a mask to allow light to strike only the required features on the surface of the chip.

  19. lamp mask chip Photolithography • Light-activated chemical reaction • For addition of bases to growing oligonucleotide • Custom masks • Prevent light from reaching spots where bases not wanted • Mirrors also used • NimbleGen™ uses this approach

  20. light Example: building oligonucleotides by photolithography • Want to add nucleotide G • Mask all other spots on chip • Light shines only where addition of G is desired (light “deprotects” the oligonucleotides at the desired location on the chip) • G added and reacts • Now G is on subset of oligonucleotides

  21. Design of oligonucleotides by photolithography • There are four types of masks according to the added nucleotide. Given a set of oligos to synthesize, the mask is a common supersequence of the oligo set or, in other words, each oligo is a subsequence of the mask sequence (characters may be separated, but they remain in the same order. • To minimize the number of masks necessary to build a supersequence of a given set of words, so-called the shortest common supersequence problem, or SCS-problem, is a NP-hard problem. • We call realization of an oligo a sequence of masks capable to synthesis it. The number of realizations • Count the realizations of the probe sequence GTATC (L=5) in the mask sequence GGTTATC (L=7). • It is found that the following four sets of positions can match the probe sequences; (1,3,5,6,7), (1,4,5,6,7), (2,3,5,6,7) and (2,4,5,6,7).

  22. Design of oligonucleotides by photolithography • Count the realizations of the probe sequence ATTAC in the mask sequence ATTATTACAC. The left and right copies are indicated by sign + and -. The instances of identical characters in these intervals are marked by a x. • The realizations (1,2,3,4,8), (1,3,5,7,8), …. Total number of possible paths from Start to End is 23. • Circle denotes the possible position of probe sequence within mask sequence. • Edge denotes consecutive positions in the probe sequence. 吳哲賢 生物晶片之探針辨識數目問題 第二十四屆組合數學與計算理論研討會

  23. light Example: adding a second base • Want to add T • New mask covers spots where T not wanted • Light shines on mask • T added • Continue for all four bases • Need 80 masks for total 20-mer oligonucleotide

  24. Ink-jet printer microarrays • Ink-jet printhead draws up DNA • Printhead moves to specific location on solid support • DNA ejected through small hole • Used to spot DNA or synthesize oligonucleotides directly on glass slide • Use pioneered by Agilent Technologies, Inc.

  25. Comparisons of microarrays Photolithograhy Mechanical printing Ink-jet printing

  26. Comparison of microarray hybridization • Spotted microarrays • Competitive hybridization • Two labeled cDNAs hybridized to same slide measure the relative difference between the signal intensity of two targets binding to the same spot of DNA • Affymetrix GeneChips • One labeled RNA population per chip • Comparison made between hybridization intensities of same oligonucleotides on different chips

  27. Selection of genes for spotting on arrays • Suppose you are interested in a family of proteins, say a particular class of receptors • To identify all the genes that are part of the family, you can do a homology search (PSI-BLAST) or a PubMed keywords search • PSI-BLAST http://www.ncbi.nlm.nih.gov/BLAST/ • Another way is to use a commercial Affymetrix array • In the context of spotted arrays, the term probe often refers to the labelled population of nucleic acid in solution, while in connection with GeneChipsTM it is used to refer to the nuclei acid attached to the array. • In the MIAME convention probe is referring to the mobile population of nucleic acid as the labelled extract and the nucleic acid attached to the array as the reporter, feature or spot Target - labeled RNA or cDNA Probe – the bound DNA

  28. Selection of regions within genes • Once you have the list of genes you wish to spot on the array • The next question is cross-hybridization • How can you prevent spotting probes (similar) that are complementary to more than one gene (target mRNA or cDNA seq.) if you are working with a gene family with similarities in sequence (such as > 70% similarity) ? • That is a probe could cross-hybridized with different mRNA • or a gene’s mRNA could cross-hybridized with different probe non-specific  not a true expression level of the gene under study • Solve this problem by using  ProbeWiz Server • Use Blast to find regions in those genes that are the least homologous to other genes • ProbeWiz - http://www.cbs.dtu.dk/services/DNAarray/probewiz.php

  29. Selection of primers for PCR • Once those unique regions have been identified, the probe needs to be designed  use PCR amplification of a probe • Solve this problem by using  ProbeWiz or OligoArray Servers • ProbeWiz • predicts optimal PCR primer pairs for generation of probes for cDNA arrays • avoid self-hybridization  hairpin structure  high specificity • http://www.cbs.dtu.dk/services/DNAarray/probewiz.php • OligoArray • Genome-scale oligonucleotide design for microarrays • http://berry.engin.umich.edu/oligoarray2/ • Other option - By using long oligonucleotides (50 to 70 bps) instead of PCR primers • Other complicated issues: alternative splicing, SNP

  30. Selection of primers for PCR Minimal primer set (MPS) problem • Given a set of ORF sequences S = {S1, S2, …Sn}, L is the length of the primer, one needs to find the minimal set of primer P = {P1, P2, …Pk} , such that for every i, Si contains at least one sequence from P. • In other words, identify a set of primers P, which is common among the set of ORF sequences S • Then selected highly specific primers (dissimilar to the complementary strand of the template, other they will hybridize to a lot of positions along the template) from P Example • S = {ATTC, GATT, TTAC}, • L = 3  MPS = {ATT, TTA} or {ATT, TAC} • if L = 2  MPS = {TT}

  31. Probe pre-selection Hybridization prediction Probe selection Selection of whole genome oligonucleotide or cDNA primers • Automatic generation of whole genome oligonucleotide or cDNA probes • Probe pre-selection • by suffix tree algorithm, size of memory spaceing O(n) ~ 40n, where n is the length of the input seq. (e.g. 10000 Hs gene seqs. is about 35MB in length, 39000 human gene seqs.  memory space ~ 40*35*3.9 = 5460 MB !! • Probes are filtered for length, GC content and not contain self complementary regions >4bp • Hybridization prediction • The most time-consuming part • Need to predicts melting temperatures Tm for all probes (on average 4 probes/gene  do a 4*39000 vs. 39000 Tm calculations (i.e. 6,084,000,000 Mfold) • Probe selection • Select the probe-target vs. probe-non-target seqs.

  32. Suffix tree - Basic notation • Concatenation (串聯) of two strings s and t is denoted by st and is formed by appending all characters of t after s, in the order they appear in t, for instance, if s =GGCTA and t=CAAC, then st=GGCTACAAC. The length of st is |s|+|t|. • A prefix of s is any substring of s of the form s[1….j] for 0≦j≦ |s|. It is admit j=0 and define s[1….0] as being the empty string, which is a prefix of s as well. Note that t is a prefix of s if and only if there is another string u such that s=tu. Sometimes one needs to refer to the prefix of s with exactly k characters, with 0≦k≦|s|, and we use the notation prefix(s,k) to denote this string. • prefix(s,3) ATT is a prefix of ATTCGATTTTAC • A suffixof s is a substring of the form s[i….|s|] for a certain i such that 1≦i≦ |s|+1. one admit i=|s|+1, in which case s[|s|+1….|s|] denotes the empty string. A string t is a suffix of s if and only if there is another string u such that s=ut. The notation suffix(s,k) denotes the unique suffix of s with k characters, for 0≦k≦|s|. • suffix(s,3) TAC is a suffix of ATTCGATTTTAC

  33. Suffix tree • Given a set of three ORF sequences S = {S1,S2,S3}, S1= {AATG}, S2={TTTG}, and S3 ={TTTC}. • Merging S1 S2 S3 together to form AATG$1TTTG$2TTTC$3, with a total length of 15. • Leaf A • AATG$1TTTG$2TTTC$3 with a length of 15 • AATG$1TTTG$2TTTC$3 with a length of 14 • Leaf C • AATG$1TTTG$2TTTC$3 with a length of 2 • Leaf G • AATG$1TTTG$2TTTC$3 with a length of 12 • AATG$1TTTG$2TTTC$3 with a length of 7 • Leaf T • AATG$1TTTG$2TTTC$3 with a length of 3 • AATG$1TTTG$2TTTC$3 with a length of 13 • AATG$1TTTG$2TTTC$3 with a length of 8 • AATG$1TTTG$2TTTC$3 with a length of 4 • AATG$1TTTG$2TTTC$3 with a length of 9 • AATG$1TTTG$2TTTC$3 with a length of 5 • AATG$1TTTG$2TTTC$3 with a length of 10 • Leaf $1 • AATG$1TTTG$2TTTC$3 with a length of 11 • Leaf $2 • AATG$1TTTG$2TTTC$3 with a length of 6 • Leaf $3 • AATG$1TTTG$2TTTC$3 with a length of 1 H. Chen and Y.-S. Hou, A study on specific primer selection algorithms using suffix trees, Journal of information technology and applications, Vol. 1, No. 1, 25-30, 2006.

  34. Suffix tree • Edges are directed away from the root, and each edge is labeled by a substring from S. • All edges coming out of a given vertex have different labels, and all such labels bhave different prefixes (not counting the empty prefix). • To each leaf there corresponds a suffix from S, and this suffix is obtained by concatenating all labels on all edges on the path from the root to the leaf.

  35. cDNA microarrays Microarrays are used to measure gene expression levels in two different conditions. Greenlabel for the control sample and a red one for the experimental sample. DNA-cDNA or DNA-mRNA hybridization.The hybridised microarray is excited by a laser and scanned at the appropriate wavelenghts for the red and green dyesAmount of fluorescence emitted (intensity) upon laser excitation ~ amount of mRNA bound to each spotIf the sample in control/experimental condition is in abundance  green/red, which indicates the relative amount of transcript for the mRNA (EST) in the samples. If both are equal  yellowIf neither are present  black

  36. Scanning of microarrays • Confocal laser scanning microscopy • Laser beam excites each spot of DNA • Amount of fluorescence detected • Different lasers used for different wavelengths • Cy3 • Cy5 laser detection

  37. Analysis of hybridization • Results given as ratios • Images use colors: Cy3 = Green Cy5 = red Yellow • Yellow is equal intensity or no change in expression

  38. Example of spotted microarray • RNA from irradiated cells (red) • Compare with untreated cells (green) • Most genes have little change (yellow) • Gene CDKN1A:red = increase in expression • Gene Myc: green = decrease in expression CDKNIA MYC

  39. Microarray images produced with a pin-and-loop arrayer. (A) Two common undesirable features are indicated, namely high local background (arrow head) and scratches (two arrows) that would suggest “flagging” of the associated spots. (B) A close-up of a portion of the array demonstrates the uniformity of relative hybridization within each spot and differences in the red:green ratio of reach clone. Visualizing the hybridized target on a microarray can be performed by using either a confocal detector or a charge couple detector (CCD) camera.

  40. Probe genes Target cDNA labeled by Cy5 (Red) cDNA labeled by Cy3 (Green) By Hanne Jarmer, BioCentrum-DTU, Technical University of Denmark Microarray – overview

  41. What can we learn from the microarray data ? • Microarray permits an integrated approach to biology, in which genetic regulation can be examined  allows us to build a gene network • Classification of disease, diagnosis, prognostic (judgment of the likely or expected development of a disease) prediction and pharmaceutical applications

  42. Co-expression of gene expression • Co-expressed genes  genes involved in common processes  clustering of genes Examples • Genes required for nutrition and stress responses • Genes whose products encode components of metabolic pathways • Genes encoding subunits of multi-subunit complexes such as the ribosome, the proteasome and the nucleosome are coordinately expressed • Ribosome - site of cellular protein synthesis  • Proteasome - large multi-enzyme complexes that digest proteins • Nucleosome– A length of DNA consisting of about 140 base pairs • makes two turns around the histone core thus forming a nucleosome. • Animation - http://www.johnkyrk.com/index.html http://fig.cox.miami.edu/~cmallery/150/cells/organelle.htm

  43. Co-expression of gene expression • Waves of co-expressed temporally regulated genes has been observed during the development of the rat spinal cord • the expression levels of 112 genes at nine different time points are measured during the development of rat cervical spinal cord, and 70 genes during development and following injury of the hippocampus) http://www.cs.unm.edu/~patrik/networks/data.html

  44. Gene expression profile and phenotype • Profile or so-called signature • the combination of the mRNAs (representing a subset of the total genotype) being expressed by the cell [Thomas A. Houpt, Nutrition, 827 (2000)] • Can be thought of a s a precise molecular definition of the cell in a specific state • Expression profile is a way to describe a phenotype, and can be used to characterize a wide variety of samples • Example • human cancer cell lines treated with 70000 agents independently or in combinations have been used to link drug activity with its mode of action • genes and putative drug targets

  45. Affymetrix GeneChip experiment • RNA from four different types of brain tumors extracted • Extracted RNA hybridized to GeneChips containing approximately 6,800 human genes • Identified gene expression profiles specific to each type of tumor

  46. Affymetrix GeneChip experiment - Profiling tumors • Image portrays gene expression profiles showing differences between four different types of brain tumors • Tumors: MD (medulloblastoma) Mglio (malignant glioma) Rhab (rhabdoid) PNET (primitive neuroectodermal tumor) • Ncer: normal cerebella

  47. Affymetrix GeneChip experiment - Cancer diagnosis by microarray • Gene expression differences for medulloblastoma correlated with response to chemotherapy • Those who failed to respond had a different profile from survivors • Can use this approach to determine which tumors are likely to respond to different treatment 60 different samples

  48. Microarray data generation, processing and analysis Two parts • Material processing and data collection • Information processing Five steps - Material processing and data collection • Array fabrication • Preparation of the biological samples to be studied • Extraction and labeling of the RNA from the samples • Hybridization of the labeled extracts to the array • Scanning of the hybridized array

  49. Microarray data generation, processing and analysis Four steps - Information processing • Image quantitation – locating the spots and measuring their fluorescence intensities • Data normalization and integration – construction of the gene expression matrix from sets of spot • Gene expression data analysis and mining – finding differentially expressed genes or clusters of similarly expressed genes • Generation from these analyses of new hypotheses about the underlying biological processes  stimulates new hypotheses that in turn should be tested in follow-up experiments Image analysis Data analysis clustering http://www.mathworks.com/company/pressroom/image_library/biotech.html

  50. Microarray data processing and analysis Microarray experimental raw data (image data)  spot quantitation matrices (row = spot on array, column = quantitation of that spot, i.e. mean, median, background)  gene expression matrix  data analysis (clustering or classification (SVD or PCA, see http://public.lanl.gov/mewall/kluwer2002.html)) http://www.ebi.ac.uk/microarray/biology_intro.html

More Related