470 likes | 837 Views
91.350/580 Topics in Bioinformatics. What is bioinformatics ? DNA sequences Protein sequences/structures Modeling/inference Intersection biology statistics computer science algorithms machine learning. DNA and RNA.
E N D
91.350/580 Topics in Bioinformatics • What is bioinformatics ? • DNA sequences • Protein sequences/structures • Modeling/inference • Intersection • biology • statistics • computer science • algorithms • machine learning
DNA and RNA • DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed of linear chains of monomeric units of nucleotides • A nucleotide has three parts: a sugar, a phophate and a base • Four bases
Base Types • Nucleic acid bases are of two types • Pyrimidine [pairímədì:n]– C, T, U (two nitrogens in 6-member ring at positions 1 and 3) • Purine – A, G (pyrimidine ring fused to an imidazole ring (C3H4N2))
Primary Structure of DNA and RNA • Nucleotides are joined by phosphodiester bonds and form sugar-phosphate backbone • Sugar is deoxyribose in DNA (left)and ribose in RNA (right) • Nitrogen-containing nucleobases are bonded to sugar
Online course on Biology • Educational Portal • DNA chemical structure • http://education-portal.com/academy/lesson/dna-and-the-chemical-structure-of-nucleic-acids.html
Secondary Structure • Double helix – 1953 Watson and Crick using X-ray diffraction • Sugar-phosphate backbone is the outer part of the helix • Two strands run in antiparallel directions • Dimensions • Inside diameter of backbone: 11 A (1.1 nm) • Outer diameter: 20 A (1A=10-10 m =0.1 nm) • Length of one complete turn: 34 A, 10 base-pairs • Major and minor grooves – drugs or polypeptides bind to DNA
Secondary Structure of DNA • Two strands are complementary • Base pairing: A-T; G-C • Pyrimidine and Purine form complementary H bonding
Monomer counts in DNA • In double strands • # of A = # of T; # of G = # of C • Erwin Chargaff’s 1st Parity Rule, 1951 • In a single strand ? • # of A = # of T; # of G = # of C • Erwin Chargaff’s 2nd Parity Rule
Importance of Hydrogen Bonding • Many consider hydrogen bond essential to the evolution of life • Individual hydrogen bond is weak, many H bonds collectively exert very strong force • Orderly repetitive arrangement of H bonds in polymers determines their shape
Online course on Biology • Educational Portal • Four bases • http://education-portal.com/academy/lesson/dna-adenine-guanine-cytosine-thymine-complementary-base-pairing.html
Chromosome Length • 3.4A per base • 3 Billion bases • 1.8 meters of DNA • 0.09 nm of chromatin after being wound on histones • Five families of histones • H1/H5, H2A, H2B, H3, and H4
RNA • Sugar in RNA nucleotide is ribose rather than 2’-deoxyribose • Thymine is replaced by uracil (U) • RNA polymers are usually a few thousand nucleotides or shorter • RNA in cells is usually single-stranded • RNA is considered to be the original gene coding material, and it still code genes in a few viruses
RNA Types • Four RNA’s are involved in protein synthesis
Online course on Biology • Educational Portal • RNA • http://education-portal.com/academy/lesson/differences-between-rna-and-dna-types-of-rna-mrna-trna-rrna.html
Gene to Protein Protein Coding Region 3’UTR 5’UTR Non-Protein Coding Region Non-Protein Coding Region Protein2 Protein 1 exon intergenic intron UTR
Translation • Genetic Code • A triplet (called codon) • Ribosome moves along mRNA 3 bases at a time • Degenerate coding • 4x4x4=64 possible triplets into 20 Amino Acids • 8 AA have 3rd base irrelevant – immune to mutation • Anti-codon – reverse complement of a codon
Amino Acids • General structure of amino acids • an amino group • a carboxyl group • α-carbon bonded to a hydrogen and a side-chain group, R • R determines the identity of particular amino acid • R: large white and gray • C: black • Nitrogen: blue • Oxygen: red • Hydrogen: white
Genome • Genome • The entire DNAs of a cell is the genome • Individual units for coding proteins or RNA are genes • A gene starts with ATG, ends with one or two stop codons • Called ORF (Open Reading Frame) • Biological Info • Contained in genome • Encoded in nucleotide sequences of DNA or RNA • Partitioned into discrete units, genes
Cell • Different levels of cells • Prokaryote (karyan, “kernel” in Greek)(/proekaeriəts) (pro for “before”) • Eukaryote (“true”) • Main difference is the presence of organelle, especially the nucleus, in eukaryotes
animal cell plant cell Prokaryotic cell
Three Domain • Classification purely based on biochemistry (RNA) • C. Woese, 1981 • Eubacteria (true bacteria) • Archaea (archaebacteria, early bacteria) • Eukarya (eukaryotes)
Genome Sequencing Projects • Major genome sequencing centers • U.S. Dept. of Energy Joint Genome Institute (435 projects) • J. Craig Venter Institue (302) • The Institute for Genomic Research (TIGR) (206) • Washington Univ. (184) • Institut Pasteur, Univ. of Tokyo • www.ncbi.nlm.nih.gov/genomes/static/lcenters.html • Completely sequenced genomes include • Several hundred bacteria, over 20 archea, and over 30 eukarya • Human (homo sapies), chimpanzee (Pan troglodytes), mouse (Mus musculus), brown rat (Rattus norvegicus), dog (Canis familiaris), Thale cress (Arabidopsis thaliana), rice (Oryza sativa), Fruit fly (Drosophila melanogaster), yeast (Saccharomyces cerevisiae) • http://www.ebi.ac.uk/2can/genomes/genomes.html has descriptions of species and their clinical and scientific significances • http://www.genomesonline.org has current status of genome projects
Genome Databases • Completed genomes • ftp site -- ftp://ftp.ncbi.nlm.nih.gov/genomes/ • http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/allorg.html • http://www.ebi.ac.uk/genomes/mot/index.html • http:/pir.goergetown.edu/pirwww/search/genome.html • Organism-specific databases • http://www.unledu/stc-95/ResTools/biotools/biotools10.html • http://www.fp.mcs.anl.gov/~gaasterland/genomes.html • http://www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html • http://www.bioinformatik.de/cgi-bin/browse/Catalog/Databases/Genome_Proejcts
Genomes of Prokaryotes • Circular double-stranded DNA • Protein-coding regions do not contain introns • Protein-coding regions are partially organized into operons – tandom genes transcribed into a single mRNA molecule • The density of coding region is high • ~89% in E.Coli trpE trpD The trp operon in E.Coli begins with control region, followed by genes performing successive steps in systhesis of tryptophan AA
Genome of E.Coli • Many E.Coli proteins were known before the sequencing (1853 proteins) • Genome of Escherichia coli, strain MG1655 published in 1997 • By F. Blattner at Univ. Wisconsin • 4.64 Mbp • 4284 protein-coding genes, 122 structural RNA genes, Non-coding repeat sequences, Regulatory elements, etc. • Average size of ORF is 317 AA • Average inter-genic gap is 118 bp • ¾ transcribe single genes, and the rest are operons (gene clusters) • 60% protein functions are known • http://wishart.biology.ualberta.ca/BacMap/index.html contains an atlas of bacterial genome diagram (2005)
Genome of Archea • Microorganism Methanococcus jannaschii • thrives in hydrothermal vents at temp from 48 to 94 CB genes from 45 strains • Capable of self-reproduction from inorganic components • Metabolism is to synthesize methane from H2 and CO2 • Sequenced in 1996 by TIGR • 1.665 Mbp in chromosome containing a circular DNA modecule, two extra-chromosomal elements • 1,784 protein-coding regions • Proteins in archea for transcription and translation are closer to those in eukaryote • Proteins involved in metabolism are closer to those of bacteria
Genomes of Eukarya • Majority of DNA is in the nucleus • Organized into chromosomes containing single-DNA molecule each • Smaller amount of DNA in organelles such as mitochondria and chloroplasts • Organelles originated as intra-cellular parasites • Organelle genomes usually have circular forms, but sometimes in linear or multi-circular shape • Genetic code is different that the one for nuclear genes • Diverse among species • Humans have 23 chromosomes, chimpanzees have 24 • Human chromosome #2 is equivalent to a fusion of chimpanzee chromosomes 12 and 13 • List of genome sequences • http://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes
Genome of Saccharomycescerevisiae (Yeast) • Simplest eukaryotic organism • Sequencing from 100 labs completed in 1992 • 12.06 Mbp • 16 chromosomes • 6,172 protein-coding genes • Dense: only 231 genes contain introns
Genome of Caenorhabditiselegans (C. elegans) • Completed in 1998 • First full DNA sequence of a multi-cellular organism • 97 Mbp • Paired chromosomes • XX for a self-fertilizing hermaphrodite (simultaneously male and female) • XO for male • Avg. 5 introns per gene • Proteins • 42% have homologues to other species • 34% specific to nematodes (round worms) • 24% no known homologues
Genome of Drosophila melanogaster (Fruit fly) • Completed in 1999 by Celera Genomics and Berkeley • 180 Mbp • Five chromosomes: 3 large autosomes, Y, and tiny fifth • 13,601 genes, 1 gene/8Kbp • Has 289 homologues to human genes • Such as cancer, cardiovascular, neurological, etc. • There is a fly model for Parkinson and malaria
Genome of Arabidopsis thaliana • Relatively small genome, 146 Mbp, completed in 2000 • Five chromosomes • 25,498 predicted genes; 1 gene/4.6 kbp • Proteins • Most A. thaliana proteins have homologues in animals • 60% of genes have human homologues, e.g., BRCA2 • Gene distribution • Nucleus: genome size (125 Mbp), genes (25,500) • Chloroplast: genome (154 Kbp), genes (79) • Mitochondrion: genome (367 Kbp), genes (58)
20 of 54 genes in a 340-Kbp stretch of rice genome (top) are conserved and retain the same order in five A. thalia strands
Human Genome • Human Genome Project • Conceived in 1984, begun in 1990, completed in 2001 ahead of 2003 schedule • What did the sequence reveal ? • 3 Bbp (base pair) • 24 chromosomes, • 22 autosomes plus two sex chromasomes (X,Y) • Longest 250 Mbp, shorted 55 Mbp • Mitochondrial genome • Circular DNA molecule of 16.569 Mbp • ~10**(13) cells • How many is 3 Bbp ? • Typical 11-pt font can print 60 nucleotide is 3 in (~10 cm). • In this format, 3 Bbp writes out in 5,000 mi
Genome of Homo sapiens • 22 chromosomes plus X (163 Mbp) and Y (51 Mbp) • Web resources • Interactive access to DNA and protein sequences • http://www.ensembl.org • Images of chromosomes, maps, loci • http://www.ncbi.nlm.nih.gov/projects/genome/guide/ • Gene map 99 • http://www.ncbi.nlm.nih.gov/genemap99 • overview of human genome structure • http://www.ims.u-tokyo.ac.jp/imsut/en • SNP (Single nucleotide polymorphisms) • http://snp.cshl.org • Human genetic diseases • http://www.ncbi.nlm.nih.gov/Omim (Online Mendelian Inheritance in Man ) • http://www.geneclinics.org/profiles/all-html
Human Genome Insights (ENCODE) • Majority of genome is transcribed • ~50% transposons • ~25% protein coding genes/1.3% exons • ~23,700 protein coding genes • ~160,000 transcripts • Average Gene ~ 36,000 bp • 7 exons @ ~ 300 bp • 6 introns @ ~5,700 bp • 7 alternatively spliced products (95% of genes) • RefSeq: ~34,600 “reference sequence” genes (includes pseudogenes, known RNA genes)
Genome of Homo sapiens (cont’d) • Repeat sequences >50 % of the genome • Short interspersed nuclear elements (SINEs): 13 %, LINEs: 21 % • Simple stutters (repeats of short oligomers including mini- and micro-satellites) • Triplet repeats such as CAG are implicated in numerous diseases (e.g., glutamine repeats in glutamine protein) • SNP (pronounced snip) • A->T mutation in beta-globin changes Glu -> Val, creating a sticky surface on haemoglobin molecules => sicklecell anaemia • Progeria • Avg 1 SNP/Kbp (100 SNPs per 100 Kbp) • Many 100-Kbp regions tend to remain intact, with fewer than five SNPs • discrete combinations of SNPs define individual’s haplotype (haploid genotype) • Individual genomes are characterized by a distribtuion of genetic makers including SNPs • Int’l HapMap Consortium
Genome of Homo sapiens (cont’d) • SNP consortium • Collects human SNPs, nearly 5 million SNPs • Show • Most of variations appear in all populations • However, a few SNPs are unique to particular populations • Genomes of individuals from Japan and China are very similar • Chromosome X varies more than other chromosomes (X is more subject to selective pressure) • Mitochondrial DNA • Double-stranded closed circular molecule of 16,569 bp • Inherited almost exclusively through maternal lines • Not subject to recombination, and changes only by mutation • About 1 mutation every 25,000 years
mtDNA and Y • mtDNA Inherited through maternal lines • Both sons and daughters get it from their mother • All existing sequence variants are traced back to a single woman (Mitochondrial Eve) in Africa roughly 200,000 years ago • Supports “from Africa” hypothesis • Avg difference in mtDNA between pairs of individuals is 61.1, between Africans is 76.7, between non-Africans is 38.5 • More divergent populations in Africa for much longer than in the rest of the world • Y chromosome • Most recent common male ancestor (Y-chromosome Adam) is around 59,000 years ago • Most divergent sequences are found from Africans
Prog. 1 • In the Scientific American article (www.cs.uml.edu/~kim/580/SA_genetic_code.pdf) Douglas Hofstadter claims that none of biological functions will be affected if we magically replace the known Genetic Code by a new one (the hypothesis in the middle column on page 28). • What are bases of his claim ? • State whether you agree or disagree with his claim and justify your argument. • Include references if you used any. • Due in one week on 2/3