1 / 39

Genomics and Bioinformatics The "new" biology

Genomics and Bioinformatics The "new" biology. What is genomics. Genome All the DNA contained in the cell of an organism Genomics The comprehensive study of the interactions and functional dynamics of whole sets of genes and their products . (NIAAA, NIH)

tannar
Download Presentation

Genomics and Bioinformatics The "new" biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics and BioinformaticsThe "new" biology

  2. What is genomics • Genome • All the DNA contained in the cell of an organism • Genomics • The comprehensive study of the interactions and functional dynamics of whole sets of genes and their products. (NIAAA, NIH) • A "scaled-up" version of genetics research in which scientists can look at all of the genes in a living creature at the same time. (NIGMS, NIH) • Which organism’s genome was sequenced first?

  3. Genome sequencing chronology http://www.ncbi.nlm.nih.gov/ICTVdb/Images/Ackerman/Phages/Microvir/238-27_1.jpg http://www.alsa.org/research/article.cfm?id=822 http://www.waterscan.co.yu/images/virusi-bakterije/Haemophilus%20influenzae.jpg http://www.biochem.wisc.edu/yeastclub/buddingyeast(color).jpg

  4. Genome sequencing chronology http://www.sih.m.u-tokyo.ac.jp/chem1.gif http://lter.kbs.msu.edu/Biocollections/Herbarium/Images/ARBTH3H.jpg

  5. Genome sequencing projects (as of 1/26,2007)

  6. Sequencing strategies: Hierarchical shotgun sequencing http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html

  7. plasmids viruses bacteria fungi plants algae insects mollusks bony fish amphibians reptiles birds mammals 104 105 106 107 108 109 1010 1011 Genome size range • What’re there in the genomes? Why are there such a big difference?

  8. Information contents in a genome • Gene • Protein coding genes • RNA genes • Regulatory elements • Gene expression control • Chromatin remodeling • Matrix attachment sites • “Non-functional” elements • Selfish elements • “Junk” DNA • ??

  9. The “central dogma” of molecular biology • Central dogma Replication DNA Transcription RNA Translation Protein

  10. Expanded “central dogma” of molecular biology • A more comprehensive view Replication DNA Transcription RNA Translation Pheno- type Protein Metabolite

  11. New disciplines due to the advance in genomics • Omics Genomic DNA sequences Replication Structural genomics DNA Transcription Transcript seq Microarray data Cis-elements TF binding sites Epigenetic regulation Transcriptomics RNA Translation Shotgun protein seq Subcellular location Post-translational mod Protein interaction Protein structure Pheno- type Proteomics Protein Genetic interactions Systematic KO Disease information Metabolite concn Metabolic flux Metabolomics Metabolite

  12. Nature omics gateway http://www.nature.com/omics/subjects/index.html

  13. 2-100x106 species ~1014 cells per individual ~3x104 genes Three perspectives of our biological world • The cellular level, the individual, the tree of life Rosenzweig et al., 2002. Conservation Biol. Image: htto://www.tolweb.org/tree/ Image: http://www.olympusfluoview.com/gallery/cells/hela/helacells.html

  14. Further complications • Cell-cell interactions • Cell types • Environmental conditions • Developmental programming • Interactions at the organismal level • Interactions at the population, ecosystem level

  15. Definition of bioinformatics • Bioinformatics • Research, development, or application of • Computational tools and approaches for expanding the use of • Biological, medical, behavioral or health data, including those to • Acquire, store, organize, archive, analyze, or visualize such data. • Computational biology • The development and application of • Data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to • The study of biological, behavioral, and social systems • Q: What kinds of data are we taking about? http://www.bisti.nih.gov/

  16. Example: Sequence assembly • Cut into ~150kb pieces • Clone into Bacterial Artificial Chromosome (BAC) • Mapped to determine order of the BAC clones (golden/tiling path) • Shear a BAC clone randomly • Sequencing • Assembie sequence reads http://www.bio.davidson.edu/courses/GENOMICS/method/shotgun.html

  17. Sequence assembly • Challenges • The presence of gaps • Due to incomplete coverage • Sequencing error and quality issue: worse at the end of reactions • So can’t rely on perfectly identical sequences all the time • Sequences derived from one strand of DNA • Need to take orientations of reads into account • Non-random sequencing of DNA • Presence of repeats Correct layout Mis-assembly http://www.cbcb.umd.edu/research/assembly_primer.shtml

  18. Overlap-layout consensus • The relationships between reads can be represented as a graph • Nodes (vertices): reads • Edges (lines): connecting “overlapping reads” • Goal: identifying a path through that graph that visits each node exactly once Genome 2 1 2 3 4 1 4 3 http://en.wikipedia.org/wiki/Image:Hamilton_path.gif

  19. Example: Gene prediction • How can we identify functional elements in the genomes? • How can we assign functions to these elements? • How can we determine/predict the structures of these elements? • How can we reconstruct networks describing the relationships and dynamics between these elements? • How can we link genotypes to phenotypes?

  20. Characteristic of protein coding genes • Similarity to other genes • Assuming there is some level of conservation. • Substitutions that change amino acids vs. those that won’t. http://www.mun.ca/biology/scarr/MGA2_03-20.html

  21. Hidden Markov Model and gene finding • Goal: • Choose a path that maximize the probability that you will enjoy the trip (or the other way around if you wish) • How is the probability determined? p = p(EL-CHI)*p(CHI-MAD) = 0.5*0.4 = 0.2

  22. Example: Sequence alignment • Align retinol-binding protein and b-lactoglobulin >RBP MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIV >lactoglobulin MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || . | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin

  23. Goal of PSA • Find an alignment between 2 sequences with the maximum score

  24. Extreme value distribution • Normal vs. extreme value distribution 0.40 normal distribution 0.35 0.30 extreme value distribution 0.25 0.20 probability 0.15 0.10 0.05 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 x

  25. Example: Microarray • A solid support (e.g. a membrane or glass slide) on which DNA of known sequence is deposited in a grid-like fashion http://shadygrove.umbi.umd.edu/microarray/Microarray.gif

  26. Microarray data analysis • A simplified pipeline http://www.microarray.lu/images/overview_1.jpg

  27. What’s in the cel files • Intensities of perfect and mismatch probes #### Dimension of the data matrix nrow(M); ncol(M) ### Perfect match pm <- pm(M) # perfect match intensities dim(pm) # dimension of the pm matrix pm[1:5,] # the first five columns summary(pm) # summary stat for the pm matrix GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL GSM131161.CEL GSM131162.CEL [1,] 252.5 267.0 349.0 424.8 213.5 237.8 [2,] 138.0 129.8 147.5 335.5 215.3 142.3 [3,] 172.3 155.5 174.8 411.8 241.0 128.3 [4,] 163.3 142.8 155.5 494.3 225.5 119.5 [5,] 259.5 257.3 245.3 505.5 308.8 217.0 GSM131151.CEL GSM131152.CEL GSM131153.CEL GSM131160.CEL Min. : 56.3 Min. : 67.5 Min. : 69.5 Min. : 96.0 1st Qu.: 144.3 1st Qu.: 143.3 1st Qu.: 157.3 1st Qu.: 303.6 Median : 212.5 Median : 215.0 Median : 234.8 Median : 414.5 Mean : 423.1 Mean : 437.5 Mean : 458.4 Mean : 648.2 3rd Qu.: 383.5 3rd Qu.: 397.8 3rd Qu.: 426.0 3rd Qu.: 637.0 Max. :39818.5 Max. :39268.0 Max. :28628.0 Max. :24854.5

  28. Probe intensity behaviors between arrays • Distributions vary widely between experiments ### Summarize the intensity par(mfrow=c(1,2)) # get a plotting region with 1 row, 2 col hist(M) # generate log2 histograms boxplot(M) # generate log2 boxplots log intensity

  29. Example: Identification of cis-elements • The on-off switches and rheostats of a cell operating at the gene level. • They control whether and how vigorously that genes will be transcribed into RNAs. http://genomicsgtl.energy.gov/science/generegulatorynetwork.shtml

  30. Motif model: Position Frequency Matrix (PFM) • fb,i: freuqnecy of a base b occurred at the i-th position D’haeseleer (2006) Nature Biotech. 24:423

  31. Motif model: Position Weight Matrix (PWM) • Suppose pA,T = 0.32 and pG,C = 0.18 (Arabidopsis thaliana) Position Frequency Matrix Position Wight Matrix

  32. Example: Cis-regulatory logic • Based on a high confidence set of binding sites: • 3,353 interactions between • 116 regulators and • 1,296 promoters Harbison et al. (2004) Nature 43:99

  33. Identification of putative cis elements • Pearson's correlation coefficient as the similarity measure. • k-mean clustering to identify co-regulated genes. • Motifs identified only with AlignACE Beer and Tavazoie (2004) Cell 117:185

  34. Bayesian network • Bayes' theorem • Bayesian network Charniak (1991) Bayesian networks without tears

  35. Final example: Relationships between sequences • Sanger and colleagues (1950s): 1st sequence • Insulin from various mammals

  36. External branch Operational taxonomic unit A Ancestral taxonomic units 2 2 1 1 1 1 B 2 C 2 2 2 2 1 D 1 6 6 Internal branch E Trees • An acyclic, un-directed graph with nodes and edges A F B G C I H D E time one unit Li 1997. Molecular Evolution. p101

  37. Enumerating trees • Suppose there are n OTUs (n ≥ 3) • Bifurcating rooted trees: • Unrooted trees: • For 10 OTUs • 3.4x107 possible rooted trees • 2.0x106 possible unrooted trees http://w3.uniroma1.it/cogfil/philotrees.jpg

  38. Impacts of genomics and bioinformatics • New ways to ask and answer question? • Hypothesis driven vs. data driven • A matter of scale • A matter of integration • Quantitative emphasis • Multi-displinary approaches • How is genomics different from genetics? • Whole genome approach versus a few genes • Investigations into the structure and function of very large numbers of genes undertaken in a simultaneous fashion. • Genetics looks at single genes, one at a time, as a snapshot. • Genomics is trying to look at all the genes as a dynamic system, over time, and determine how they interact and influence biological pathways and physiology, in a much more global sense

  39. The END • ...

More Related