1 / 47

91.350/580 Topics in Bioinformatics

91.350/580 Topics in Bioinformatics. What is bioinformatics ? DNA sequences Protein sequences/structures Modeling/inference Intersection biology statistics computer science algorithms machine learning. DNA and RNA.

evonne
Download Presentation

91.350/580 Topics in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 91.350/580 Topics in Bioinformatics • What is bioinformatics ? • DNA sequences • Protein sequences/structures • Modeling/inference • Intersection • biology • statistics • computer science • algorithms • machine learning

  2. DNA and RNA • DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed of linear chains of monomeric units of nucleotides • A nucleotide has three parts: a sugar, a phophate and a base • Four bases

  3. Base Types • Nucleic acid bases are of two types • Pyrimidine [pairímədì:n]– C, T, U (two nitrogens in 6-member ring at positions 1 and 3) • Purine – A, G (pyrimidine ring fused to an imidazole ring (C3H4N2))

  4. Primary Structure of DNA and RNA • Nucleotides are joined by phosphodiester bonds and form sugar-phosphate backbone • Sugar is deoxyribose in DNA (left)and ribose in RNA (right) • Nitrogen-containing nucleobases are bonded to sugar

  5. Online course on Biology • Educational Portal • DNA chemical structure • http://education-portal.com/academy/lesson/dna-and-the-chemical-structure-of-nucleic-acids.html

  6. Secondary Structure • Double helix – 1953 Watson and Crick using X-ray diffraction • Sugar-phosphate backbone is the outer part of the helix • Two strands run in antiparallel directions • Dimensions • Inside diameter of backbone: 11 A (1.1 nm) • Outer diameter: 20 A (1A=10-10 m =0.1 nm) • Length of one complete turn: 34 A, 10 base-pairs • Major and minor grooves – drugs or polypeptides bind to DNA

  7. Secondary Structure of DNA • Two strands are complementary • Base pairing: A-T; G-C • Pyrimidine and Purine form complementary H bonding

  8. Monomer counts in DNA • In double strands • # of A = # of T; # of G = # of C • Erwin Chargaff’s 1st Parity Rule, 1951 • In a single strand ? • # of A = # of T; # of G = # of C • Erwin Chargaff’s 2nd Parity Rule

  9. Importance of Hydrogen Bonding • Many consider hydrogen bond essential to the evolution of life • Individual hydrogen bond is weak, many H bonds collectively exert very strong force • Orderly repetitive arrangement of H bonds in polymers determines their shape

  10. Online course on Biology • Educational Portal • Four bases • http://education-portal.com/academy/lesson/dna-adenine-guanine-cytosine-thymine-complementary-base-pairing.html

  11. Chromosome Length • 3.4A per base • 3 Billion bases • 1.8 meters of DNA • 0.09 nm of chromatin after being wound on histones • Five families of histones • H1/H5, H2A, H2B, H3, and H4

  12. RNA • Sugar in RNA nucleotide is ribose rather than 2’-deoxyribose • Thymine is replaced by uracil (U) • RNA polymers are usually a few thousand nucleotides or shorter • RNA in cells is usually single-stranded • RNA is considered to be the original gene coding material, and it still code genes in a few viruses

  13. RNA Types • Four RNA’s are involved in protein synthesis

  14. Online course on Biology • Educational Portal • RNA • http://education-portal.com/academy/lesson/differences-between-rna-and-dna-types-of-rna-mrna-trna-rrna.html

  15. Gene to Protein:Transcription & Translation

  16. Gene to Protein

  17. Gene to Protein Protein Coding Region 3’UTR 5’UTR Non-Protein Coding Region Non-Protein Coding Region Protein2 Protein 1 exon intergenic intron UTR

  18. Example

  19. Translation • Genetic Code • A triplet (called codon) • Ribosome moves along mRNA 3 bases at a time • Degenerate coding • 4x4x4=64 possible triplets into 20 Amino Acids • 8 AA have 3rd base irrelevant – immune to mutation • Anti-codon – reverse complement of a codon

  20. Genetic Code

  21. Amino Acids • General structure of amino acids • an amino group • a carboxyl group • α-carbon bonded to a hydrogen and a side-chain group, R • R determines the identity of particular amino acid • R: large white and gray • C: black • Nitrogen: blue • Oxygen: red • Hydrogen: white

  22. Genome • Genome • The entire DNAs of a cell is the genome • Individual units for coding proteins or RNA are genes • A gene starts with ATG, ends with one or two stop codons • Called ORF (Open Reading Frame) • Biological Info • Contained in genome • Encoded in nucleotide sequences of DNA or RNA • Partitioned into discrete units, genes

  23. Cell • Different levels of cells • Prokaryote (karyan, “kernel” in Greek)(/proekaeriəts) (pro for “before”) • Eukaryote (“true”) • Main difference is the presence of organelle, especially the nucleus, in eukaryotes

  24. animal cell plant cell Prokaryotic cell

  25. Three Domain • Classification purely based on biochemistry (RNA) • C. Woese, 1981 • Eubacteria (true bacteria) • Archaea (archaebacteria, early bacteria) • Eukarya (eukaryotes)

  26. Genome Sequencing Projects • Major genome sequencing centers • U.S. Dept. of Energy Joint Genome Institute (435 projects) • J. Craig Venter Institue (302) • The Institute for Genomic Research (TIGR) (206) • Washington Univ. (184) • Institut Pasteur, Univ. of Tokyo • www.ncbi.nlm.nih.gov/genomes/static/lcenters.html • Completely sequenced genomes include • Several hundred bacteria, over 20 archea, and over 30 eukarya • Human (homo sapies), chimpanzee (Pan troglodytes), mouse (Mus musculus), brown rat (Rattus norvegicus), dog (Canis familiaris), Thale cress (Arabidopsis thaliana), rice (Oryza sativa), Fruit fly (Drosophila melanogaster), yeast (Saccharomyces cerevisiae) • http://www.ebi.ac.uk/2can/genomes/genomes.html has descriptions of species and their clinical and scientific significances • http://www.genomesonline.org has current status of genome projects

  27. Genome Databases • Completed genomes • ftp site -- ftp://ftp.ncbi.nlm.nih.gov/genomes/ • http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/allorg.html • http://www.ebi.ac.uk/genomes/mot/index.html • http:/pir.goergetown.edu/pirwww/search/genome.html • Organism-specific databases • http://www.unledu/stc-95/ResTools/biotools/biotools10.html • http://www.fp.mcs.anl.gov/~gaasterland/genomes.html • http://www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html • http://www.bioinformatik.de/cgi-bin/browse/Catalog/Databases/Genome_Proejcts

  28. Genomes of Prokaryotes • Circular double-stranded DNA • Protein-coding regions do not contain introns • Protein-coding regions are partially organized into operons – tandom genes transcribed into a single mRNA molecule • The density of coding region is high • ~89% in E.Coli trpE trpD The trp operon in E.Coli begins with control region, followed by genes performing successive steps in systhesis of tryptophan AA

  29. Genome of E.Coli • Many E.Coli proteins were known before the sequencing (1853 proteins) • Genome of Escherichia coli, strain MG1655 published in 1997 • By F. Blattner at Univ. Wisconsin • 4.64 Mbp • 4284 protein-coding genes, 122 structural RNA genes, Non-coding repeat sequences, Regulatory elements, etc. • Average size of ORF is 317 AA • Average inter-genic gap is 118 bp • ¾ transcribe single genes, and the rest are operons (gene clusters) • 60% protein functions are known • http://wishart.biology.ualberta.ca/BacMap/index.html contains an atlas of bacterial genome diagram (2005)

  30. Genome of Archea • Microorganism Methanococcus jannaschii • thrives in hydrothermal vents at temp from 48 to 94 CB genes from 45 strains • Capable of self-reproduction from inorganic components • Metabolism is to synthesize methane from H2 and CO2 • Sequenced in 1996 by TIGR • 1.665 Mbp in chromosome containing a circular DNA modecule, two extra-chromosomal elements • 1,784 protein-coding regions • Proteins in archea for transcription and translation are closer to those in eukaryote • Proteins involved in metabolism are closer to those of bacteria

  31. Genomes of Eukarya • Majority of DNA is in the nucleus • Organized into chromosomes containing single-DNA molecule each • Smaller amount of DNA in organelles such as mitochondria and chloroplasts • Organelles originated as intra-cellular parasites • Organelle genomes usually have circular forms, but sometimes in linear or multi-circular shape • Genetic code is different that the one for nuclear genes • Diverse among species • Humans have 23 chromosomes, chimpanzees have 24 • Human chromosome #2 is equivalent to a fusion of chimpanzee chromosomes 12 and 13 • List of genome sequences • http://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes

  32. Genome of Saccharomycescerevisiae (Yeast) • Simplest eukaryotic organism • Sequencing from 100 labs completed in 1992 • 12.06 Mbp • 16 chromosomes • 6,172 protein-coding genes • Dense: only 231 genes contain introns

  33. Genome of Caenorhabditiselegans (C. elegans) • Completed in 1998 • First full DNA sequence of a multi-cellular organism • 97 Mbp • Paired chromosomes • XX for a self-fertilizing hermaphrodite (simultaneously male and female) • XO for male • Avg. 5 introns per gene • Proteins • 42% have homologues to other species • 34% specific to nematodes (round worms) • 24% no known homologues

  34. Genome of Drosophila melanogaster (Fruit fly) • Completed in 1999 by Celera Genomics and Berkeley • 180 Mbp • Five chromosomes: 3 large autosomes, Y, and tiny fifth • 13,601 genes, 1 gene/8Kbp • Has 289 homologues to human genes • Such as cancer, cardiovascular, neurological, etc. • There is a fly model for Parkinson and malaria

  35. Genome of Arabidopsis thaliana • Relatively small genome, 146 Mbp, completed in 2000 • Five chromosomes • 25,498 predicted genes; 1 gene/4.6 kbp • Proteins • Most A. thaliana proteins have homologues in animals • 60% of genes have human homologues, e.g., BRCA2 • Gene distribution • Nucleus: genome size (125 Mbp), genes (25,500) • Chloroplast: genome (154 Kbp), genes (79) • Mitochondrion: genome (367 Kbp), genes (58)

  36. 20 of 54 genes in a 340-Kbp stretch of rice genome (top) are conserved and retain the same order in five A. thalia strands

  37. Human Genome • Human Genome Project • Conceived in 1984, begun in 1990, completed in 2001 ahead of 2003 schedule • What did the sequence reveal ? • 3 Bbp (base pair) • 24 chromosomes, • 22 autosomes plus two sex chromasomes (X,Y) • Longest 250 Mbp, shorted 55 Mbp • Mitochondrial genome • Circular DNA molecule of 16.569 Mbp • ~10**(13) cells • How many is 3 Bbp ? • Typical 11-pt font can print 60 nucleotide is 3 in (~10 cm). • In this format, 3 Bbp writes out in 5,000 mi

  38. Genome of Homo sapiens • 22 chromosomes plus X (163 Mbp) and Y (51 Mbp) • Web resources • Interactive access to DNA and protein sequences • http://www.ensembl.org • Images of chromosomes, maps, loci • http://www.ncbi.nlm.nih.gov/projects/genome/guide/ • Gene map 99 • http://www.ncbi.nlm.nih.gov/genemap99 • overview of human genome structure • http://www.ims.u-tokyo.ac.jp/imsut/en • SNP (Single nucleotide polymorphisms) • http://snp.cshl.org • Human genetic diseases • http://www.ncbi.nlm.nih.gov/Omim (Online Mendelian Inheritance in Man ) • http://www.geneclinics.org/profiles/all-html

  39. Human Genome Insights (ENCODE) • Majority of genome is transcribed • ~50% transposons • ~25% protein coding genes/1.3% exons • ~23,700 protein coding genes • ~160,000 transcripts • Average Gene ~ 36,000 bp • 7 exons @ ~ 300 bp • 6 introns @ ~5,700 bp • 7 alternatively spliced products (95% of genes) • RefSeq: ~34,600 “reference sequence” genes (includes pseudogenes, known RNA genes)

  40. Genome of Homo sapiens (cont’d) • Repeat sequences >50 % of the genome • Short interspersed nuclear elements (SINEs): 13 %, LINEs: 21 % • Simple stutters (repeats of short oligomers including mini- and micro-satellites) • Triplet repeats such as CAG are implicated in numerous diseases (e.g., glutamine repeats in glutamine protein) • SNP (pronounced snip) • A->T mutation in beta-globin changes Glu -> Val, creating a sticky surface on haemoglobin molecules => sicklecell anaemia • Progeria • Avg 1 SNP/Kbp (100 SNPs per 100 Kbp) • Many 100-Kbp regions tend to remain intact, with fewer than five SNPs •  discrete combinations of SNPs define individual’s haplotype (haploid genotype) • Individual genomes are characterized by a distribtuion of genetic makers including SNPs • Int’l HapMap Consortium

  41. Genome of Homo sapiens (cont’d) • SNP consortium • Collects human SNPs, nearly 5 million SNPs • Show • Most of variations appear in all populations • However, a few SNPs are unique to particular populations • Genomes of individuals from Japan and China are very similar • Chromosome X varies more than other chromosomes (X is more subject to selective pressure) • Mitochondrial DNA • Double-stranded closed circular molecule of 16,569 bp • Inherited almost exclusively through maternal lines • Not subject to recombination, and changes only by mutation • About 1 mutation every 25,000 years

  42. mtDNA and Y • mtDNA Inherited through maternal lines • Both sons and daughters get it from their mother • All existing sequence variants are traced back to a single woman (Mitochondrial Eve) in Africa roughly 200,000 years ago • Supports “from Africa” hypothesis • Avg difference in mtDNA between pairs of individuals is 61.1, between Africans is 76.7, between non-Africans is 38.5 • More divergent populations in Africa for much longer than in the rest of the world • Y chromosome • Most recent common male ancestor (Y-chromosome Adam) is around 59,000 years ago • Most divergent sequences are found from Africans

  43. Other Species

  44. Prog. 1 • In the Scientific American article (www.cs.uml.edu/~kim/580/SA_genetic_code.pdf) Douglas Hofstadter claims that none of biological functions will be affected if we magically replace the known Genetic Code by a new one (the hypothesis in the middle column on page 28). • What are bases of his claim ? • State whether you agree or disagree with his claim and justify your argument. • Include references if you used any. • Due in one week on 2/3

More Related