620 likes | 780 Views
Microarray - Introduction. Ka-Lok Ng Department of Bioinformatics Asia University http://ppi.bioinfo.asia.edu.tw/klng. Graduate syllabus. Topics to be covered. Introduction - RNA expression Experimental design, image processing, Microarray databases
E N D
Microarray - Introduction Ka-Lok Ng Department of Bioinformatics Asia University http://ppi.bioinfo.asia.edu.tw/klng
Topics to be covered • Introduction - RNA expression • Experimental design, image processing, Microarray databases • Data normalization, filter and analysis • Statistical analysis of gene expression data • Clustering methods • Time series data (cell cycle) and reverse engineering • Gene regulatory networks • Gene regulatory networks and protein-protein interaction networks
Classwork, homework, class attendance • Mid-term, final exam. or SCI paper/thesis presentation (for graduate students)
References • Gibson G., Muse Spencer. A primer of Genome Science, Ch. 4. 2nd edition, Sinauer (2004) • Causton H., Quackenbush J., and Brazma A. Microarray Gene Expression Data Analysis. A Beginner’s Guide. Blackwell (2003) • Baxevanis A. and Ouellette B.F. Francis. Bioinformatics Ch. 16. J. Wiley (2005) • Knudsen S. A Biologist’s Guide to Analysis of DNA Microarray Data. J. Wiley (2002) • Benfey P. and Protopapas A.D. Genomics Ch. 5. Prentice Hall (2005). • Setubal J. and Meidanis J. Introduction to computational molecular biology. PWS publishing. (1997). • A. Gu´enoche (2005). “About the design of oligo-chips”, Discrete Applied Mathematics, v147(1), pp.57-67.
Contents • Introduction – the central dogma of molecular biology, applications, data analysis, Microarray slide surface • Printing technologies – spotting, photolithography, ink-jet • Selection of genes for spotting on arrays • Selection of primers for PCR – suffix tree • Microarray application - four different types of brain tumors • Gene co-expression and gene expression profile • Data management
Knowledge is the process of piling up facts; wisdom lies in their simplification. Martin H. Fischer
Introduction • The last 10 years have brought spectacular achievements in genome sequencing (such as the HGP) • It took >1000 years for science to progress from human anatomy to understand how genomes function) • Even if we assume all the genes have correctly identified, the results represents only sequence • High throughput DNA sequencing technology created a system approach to biology
The central dogma of molecular biologyhttp://www.hort.purdue.edu/hort/courses/HORT250/lecture%2004 • Glossary • Transcripts – mRNA • Transcriptome – the complete set of transcripts • Hybridization
Microarray technology allow one to identify the genes that are expressed in different cell types, to learn how their expression levels change in different developmental stages or disease states, and to identify the cellular processes in which they participate • Microarray technology provide clues about how genes and gene products interact and their interaction networks Microarray gene expression data analysis • Experimental design data transformations from raw data to gene expression matrices data mining and analysis of gene expression matrices
What are microarrays and how do they work ? • A microarray is typically a glass or polymer slide • DNA molecules are attached at fixed locations called spots or features
Smooth surface enables even deposition of surface chemistries and perfect spot morphology.
What are microarrays and how do they work ? • ~10,000 spots on an array • each spot contains ~107 of identical DNA of lengths from 10s to 100s of bp • spots are either printed on the microarrays by a robot or jet, or synthesised by photolithography (石版影印術) or by inkjet printing Principle of cDNA microarraysEST fragments arrayed in 96- or 384-well plates are spotted at high density onto a glass microarray slide. Subsequently, two different fluorescently labeled cDNA populations derived from independent mRNA samples are hybridized to the array.
Ink-jet printer microarrays • Ink-jet printhead draws up DNA • Printhead moves to specific location on solid support • DNA ejected through small hole • Used to spot DNA or synthesize oligonucleotides directly on glass slide • Use pioneered by Agilent Technologies, Inc.
(A) Tweezer or split-pin designs transfer low nanoliter (10-9 liter) amounts of DNA to the array by capillary action as the tip strikes the solid surface.(B) TeleChemTMtips and pins apply small droplets by contact between the pin and substrate.(C) The pin-and-loop design picks up the DNA in a small loop, and a pin stamps solution on a slide at a uniform density.(D) Ink jetsspray picoliter (10-12 liter) droplets of liquid under pressure.Robotic spotting, capillary action, the DNA sticks through hydrostatic interactionsThe spacing between spot centers is specified from 120-250 mm according to the density required. The entire microarray usually covers an area 2.5x5.0 cm, though shorter grids can be printed when fewer clones are to be represented. Types of printing pins
DNA spotting I • DNA spotting usually uses multiple pins • DNA in microtiter plate • DNA usually PCR amplified • Oligonucleotides can also be spotted
Oligonucleotide microarrays – pioneered by AffymetrixAffymetrix GeneChips • Oligonucleotides • Usually at least 20–25 bases in length, optimal with 45~60 bp long • 10–20 different oligonucleotides for each gene • Oligonucleotides for each gene selected by computer program to be the following: • Unique in genome (4 (20 to 25) =2(40to 50) >> 3*109 = 230), not likely to appear twice • Non-overlapping (if the sequence length is too short then specificity is low, whereas if the length is too long, self-hybridization could happen) • Composition based design rules • Empirically derived rules (ratio of G-C pairs vs. A-T pairs which could affect the melting temperature of the seq., that is, Tm = 64.9+0.41*(GC%)-675/L, where L = length of the oligonucleotide
Construction of oligonucleotide arrays.Oligonucleotide are synthesized in situ in the silicon chip. (A) In each step, a flash of light “deprotects” the oligonucleotides at the desired location on the chip; then “protected” nucleotides of one of the four types (A, C, G or T) are added so that a single nucleotide can add to the desired chains. Oligonucleotide microarrays – pioneered by Affymetrix
Oligonucleotide microarrays Construction of oligonucleotide arrays.The light flash is produced by photolithography using a mask to allow light to strike only the required features on the surface of the chip.
lamp mask chip Photolithography • Light-activated chemical reaction • For addition of bases to growing oligonucleotide • Custom masks • Prevent light from reaching spots where bases not wanted • Mirrors also used • NimbleGen™ uses this approach
light Example: building oligonucleotides by photolithography • Want to add nucleotide G • Mask all other spots on chip • Light shines only where addition of G is desired (light “deprotects” the oligonucleotides at the desired location on the chip) • G added and reacts • Now G is on subset of oligonucleotides
light Example: adding a second base • Want to add T • New mask covers spots where T not wanted • Light shines on mask • T added • Continue for all four bases • Need 80 masks for total 20-mer oligonucleotide
Comparisons of microarrays Photolithograhy Mechanical printing Ink-jet printing
Design of oligonucleotides by photolithography • There are four types of masks according to the added nucleotide. Given a set of oligos to synthesize, the mask is a common supersequence of the oligo set or, in other words, each oligo is a subsequence of the mask sequence (characters may be separated, but they remain in the same order. • To minimize the number of masks necessary to build a supersequence of a given set of words, so-called the shortest common supersequence problem, or SCS-problem, is a NP-hard problem. • We call realization of an oligo a sequence of masks capable to synthesis it. The number of realizations • Count the number of realizations of the probe sequence GTATC (L=5) in the mask sequence GGTTATC (L=7). • It is found that the following four sets of positions can match the probe sequences; (1,3,5,6,7), (1,4,5,6,7), (2,3,5,6,7) and (2,4,5,6,7). The left copies are indicated by sign +. The instances of identical characters (repeated) in these intervals are marked by a ‘X’.
Design of oligonucleotides by photolithography • Count the realizations of the probe sequence ATTAC in the mask sequence ATTATTACAC. The left and right copies are indicated by sign + and -. The instances of identical characters in these intervals are marked by a ‘X’. • 23 realizations: (1,2,3,4,8), (1,3,5,7,8), (1,5,6,7,8,), (4,5,6,7,8) ….etc. too short Total number of possible paths from Start (S) to End (E) is 23. • Circle denotes the possible position of probe sequence within mask sequence. • Edge denotes consecutive positions in the probe sequence. 吳哲賢 生物晶片之探針辨識數目問題 第二十四屆組合數學與計算理論研討會
Comparison of microarray hybridization • Spotted microarrays • Competitive hybridization • Two labeled cDNAs hybridized to same slide measure the relative difference between the signal intensity of two targets binding to the same spot of DNA • Affymetrix GeneChips • One labeled RNA population per chip • Comparison made between hybridization intensities of same oligonucleotides on different chips
Selection of genes for spotting on arrays • Suppose you are interested in a family of proteins, say a particular class of receptors • To identify all the genes that are part of the family, you can do a homology search (PSI-BLAST) or a PubMed keywords search • PSI-BLAST http://www.ncbi.nlm.nih.gov/BLAST/ • Another way is to use a commercial Affymetrix array • In the context of spotted arrays, the term probe often refers to the labelled population of nucleic acid in solution, while in connection with GeneChipsTM it is used to refer to the nuclei acid attached to the array. • In the MIAME convention probe is referring to the mobile population of nucleic acid as the labelled extract and the nucleic acid attached to the array as the reporter, feature or spot GeneChipsTM Target - labeled cDNA or RNA Spotted probe, MIAME probe GeneChipsTM Probe – the bound DNA Spotted array – target
Selection of regions within genes • Once you have the list of genes you wish to spot on the array • The next question is cross-hybridization • How can you prevent spotting probes that are complementary to more than one gene (target mRNA or cDNA seq.) if you are working with a gene family (many similar genes) with similarities in sequence (such as > 70% similarity) ? • That is a probe could cross-hybridized with different mRNA • That is there are probes appear to bemore abundant than they really are • or a gene’s mRNA (alternative splicing mechanism could generate different mRNAs) could cross-hybridized with different probe non-specific not a true expression level of the gene under study • Not always can find a solution to the cross-hybridization problem • Solve this problem by using ProbeWiz Server • Use Blast to find regions in those genes that are the least homologous to other genes • ProbeWiz - http://www.cbs.dtu.dk/services/DNAarray/probewiz.php
Selection of primers for PCR • Once those unique regions have been identified, the probe needs to be designed use PCR amplification of a probe • Solve this problem by using ProbeWiz or OligoArray Servers • ProbeWiz • predicts optimal PCR primer pairs for generation of probes for cDNA arrays • avoid self-hybridization hairpin structure high specificity • http://www.cbs.dtu.dk/services/DNAarray/probewiz.php • OligoArray • Genome-scale oligonucleotide design for microarrays • http://berry.engin.umich.edu/oligoarray2/ • Other option - By using long oligonucleotides (50 to 70 bps) instead of PCR primers • Other complicated issues: alternative splicing, SNP
Selection of primers for PCR Minimal primer set (MPS) problem • Given a set of ORF sequences S = {S1, S2, …Sn}, L is the length of the primer, one needs to find the minimal set of primer P = {P1, P2, …Pk} , such that for every i, Si contains at least one sequence from P. • In other words, identify a set of primers P, which is common among the set of ORF sequences S • Then selected highlyspecific primers (dissimilar to the complementary strand of the template, otherwise they will hybridize to a lot of positions along the template) from P Example • S = {ATTC, GATT, TTAC}, • L = 3 P = {ATT, TTA}, P ={ATT, TAC} or {ATT, TTA} • if L = 2 P = {TT}
Probe pre-selection Hybridization prediction Probe selection Selection of whole genome oligonucleotide or cDNA primers • Automatic generation of whole genome oligonucleotide or cDNA probes • Probe pre-selection • by suffix tree algorithm, size of memory spacing O(n) ~ 40n, where n is the length of the input seq. (e.g. 10000 Hs gene seqs. is about 35MB in length, 30000 human gene seqs. memory space ~ 40*35*3.0 MB=4200 MB ! • Probes are filtered for length, GC content and not contain self complementary regions >4bp • Hybridization prediction • The most time-consuming part • Need to predicts melting temperatures Tm for all probes (on average 4 probes/gene do a 4*30000 vs. 30000 Tm calculations (i.e. 4*30000*30000 = 1.2*109 times of using the tool Mfold) • Probe selection • Select the probe-target vs. probe-non-target seqs.
Suffix tree - Basic notation • Concatenation (串聯) of two strings s and t is denoted by st and is formed by appending all characters of t after s, in the order they appear in t, for instance, if s =GGCTA and t=CAAC, then st=GGCTACAAC. The length of st is |s|+|t|. • A prefixof s is any substring of s of the form s[1….j] for 0≦j≦ |s|. It is admit j=0 and define s[1….0] as being the empty string, which is a prefix of s as well. Note that t is a prefix of s if and only if there is another string u such that s=tu. Sometimes one needs to refer to the prefix of s with exactly k characters, with 0≦k≦|s|, and we use the notation prefix(s,k) to denote this string. • prefix(s,3) ATT is a prefix of ATTCGATTTTAC • A suffixof s is a substring of the form s[i….|s|] for a certain i such that 1≦i≦ |s|+1. one admit i=|s|+1, in which case s[|s|+1….|s|] denotes the empty string. A string t is a suffix of s if and only if there is another string u such that s=ut. The notation suffix(s,k) denotes the unique suffix of s with k characters, for 0≦k≦|s|. • suffix(s,3) TAC is a suffixof ATTCGATTTTAC
Suffix tree • Suffix tree – contains all suffixes of a string factoring out common prefixes as much as possible in the tree structure • Edges are directed away from the root, and each edge is labeled by a substring from S. • All edges coming out of a given vertex have different labels, and all such labels have different prefixes (not counting the empty prefix). • To each leaf there corresponds a suffix from S, and this suffix is obtained by concatenating all labels on all edges on the path from the root to the leaf.
Suffix tree • More example, X = AATAATGC$, where $ signals the end of the sequence • Let the substring S be the shortest substring beginning at i which does not occur elsewhere in X • The longest repeat within the string is AAT Suffix tree of X, where () denotes position A ATA (1) ATG (4) TA (2) TG (5) C (8) G (7) TA (3) TG (6) $ (9)
Suffix tree • Given a set of three ORF sequences S = {S1,S2,S3}, S1= {AATG}, S2={TTTG}, and S3 ={TTTC}. • Merging S1 S2 S3 together to form AATG$1TTTG$2TTTC$3, with a total length of 15. • Non-overlap Longest Repeat among S1 and S2 is TG, and among is S2, and S3 is TTT • Leaf A • AATG$1TTTG$2TTTC$3 with a length of 15 • AATG$1TTTG$2TTTC$3 with a length of 14 • Leaf C • AATG$1TTTG$2TTTC$3 with a length of 2 • Leaf G • AATG$1TTTG$2TTTC$3 with a length of 12 • AATG$1TTTG$2TTTC$3 with a length of 7 • Leaf T • AATG$1TTTG$2TTTC$3 with a length of 3 • AATG$1TTTG$2TTTC$3 with a length of 13 • AATG$1TTTG$2TTTC$3 with a length of 8 • AATG$1TTTG$2TTTC$3 with a length of 4 • AATG$1TTTG$2TTTC$3 with a length of 9 • AATG$1TTTG$2TTTC$3 with a length of 5 • AATG$1TTTG$2TTTC$3 with a length of 10 • Leaf $1 • AATG$1TTTG$2TTTC$3 with a length of 11 • Leaf $2 • AATG$1TTTG$2TTTC$3 with a length of 6 • Leaf $3 • AATG$1TTTG$2TTTC$3 with a length of 1 H. Chen and Y.-S. Hou, A study on specific primer selection algorithms using suffix trees, Journal of information technology and applications, Vol. 1, No. 1, 25-30, 2006.
cDNA microarrays Microarrays are used to measure gene expression levels in two different conditions. Greenlabel for the control sample and a red one for the experimental sample. DNA-cDNA or DNA-mRNA hybridization.The hybridised microarray is excited by a laser and scanned at the appropriate wavelenghts for the red and green dyesAmount of fluorescence emitted (intensity) upon laser excitation ~ amount of mRNA bound to each spotIf the sample in control/experimental condition is in abundance green/red, which indicates the relative amount of transcript for the mRNA (EST) in the samples. If both are equal yellowIf neither are present black
Scanning of microarrays • Confocal laser scanning microscopy • Laser beam excites each spot of DNA • Amount of fluorescence detected • Different lasers used for different wavelengths • Cy3 • Cy5 laser detection
Analysis of hybridization • Results given as ratios • Images use colors: Cy3 = Green Cy5 = red Yellow • Yellow is equal intensity or no change in expression
Example of spotted microarray • RNA from irradiated cells (red) • Compare with untreated cells (green) • Most genes have little change (yellow) • Gene CDKN1A:red = increase in expression • Gene Myc: green = decrease in expression CDKNIA MYC
Microarray images produced with a pin-and-loop arrayer. (A) Two common undesirable features are indicated, namely high local background (arrow head) and scratches (two arrows) that would suggest “flagging” of the associated spots. (B) A close-up of a portion of the array demonstrates the uniformity of relative hybridization within each spot and differences in the red:green ratio of reach clone. Visualizing the hybridized target on a microarray can be performed by using either a confocal detector or a charge couple detector (CCD) camera.
Probe genes Target cDNA labeled by Cy5 (Red) cDNA labeled by Cy3 (Green) By Hanne Jarmer, BioCentrum-DTU, Technical University of Denmark Microarray – overview
What can we learn from the microarray data ? • Microarray permits an integrated approach to biology, in which genetic regulation can be examined allows us to build a gene network • Classification of disease, diagnosis, prognostic (judgment of the likely or expected development of a disease) prediction and pharmaceutical applications
Co-expression of gene expression • Co-expressed genes genes involved in common processes clustering of genes Examples • Genes required for nutrition and stress responses • Genes whose products encode components of metabolic pathways • Genes encoding subunits of multi-subunit complexes such as the ribosome, the proteasome and the nucleosome are coordinately expressed • Ribosome - site of cellular protein synthesis • Proteasome - large multi-enzyme complexes that digest proteins • Nucleosome– A length of DNA consisting of about 140 base pairs • makes two turns around the histone core thus forming a nucleosome. • Animation - http://www.johnkyrk.com/index.html http://fig.cox.miami.edu/~cmallery/150/cells/organelle.htm
Co-expression of gene expression • Waves of co-expressed temporally regulated genes has been observed during the development of the rat spinal cord • the expression levels of 112 genes at nine different time points are measured during the development of rat cervical spinal cord, and 70 genes during development and following injury of the hippocampus 海馬體) http://www.cs.unm.edu/~patrik/networks/data.html
Gene expression profile and phenotype • Profile or so-called signature • the combination of the mRNAs (representing a subset of the total genotype) being expressed by the cell [Thomas A. Houpt, Nutrition, 827 (2000)] • Can be thought of as a precise molecular definition of the cell in a specific state • Expression profile is a way to describe a phenotype, and can be used to characterize a wide variety of samples Example • human cancer cell lines treated with 70000 agents independently or in combinations have been used to link drug activity with its mode of action • genes and putative drug targets
Affymetrix GeneChip experiment - Profiling tumors • RNA from four different types of brain tumors extracted • Extracted RNA hybridized to GeneChips containing approximately 6,800 human genes • Identified gene expression profiles specific to each type of tumor • Image portrays gene expression profiles showing differences between four different types of brain tumors • Tumors: MD (medulloblastoma) Mglio (malignant glioma) Rhab (rhabdoid) PNET (primitive neuroectodermal tumor) • Ncer: normal cerebella
Affymetrix GeneChip experiment - Cancer diagnosis by microarray • For a single type of tumor, medulloblastoma (MD), RNA from 60 different tumor samples was analyzed • Response to chemotherapy was known for each of the tumor samples • Gene expression differences for MD correlated with response to chemotherapy • Patients who failed to respond had a different profile from survivors (who did respond and survived longer) • Can use this approach to determine which tumor samples are likely to respond to treatment 60 different samples
Microarray data generation, processing and analysis Two parts • Material processing and data collection • Information processing Five steps - Material processing and data collection • Array fabrication • Preparation of the biological samples to be studied • Extraction and labeling of the RNA from the samples • Hybridization of the labeled extracts to the array • Scanning of the hybridized array