Introduction to Biology: Cell Structure and Central Dogma

Water seeking head group Cholesterol water fatty chains Introduction to Biology Saleet Jafri GMU EUKARYA BACTERIA ARCHAEA Presumed common progenitor of all extant organism Presumed common progenitor of archaebacteria and eukaryotes

Cell wall (outer membrane) Cell wall (inner membrane) Ribosome DNA RNA Periplasmic space and cell wall Nucleoid DNA mesosome septum inner (plasma) membrane Cell wall Periplasmic space Outer membrane nucleoid outer Membrane inner membrane 0.5 m |_____|

Eucharyotic cell (organelles) Nuclear membrane Plasma (cell) membrane Gogli vesicles Mitochondrion Peroxisome lysosome Nucleus Secretory vesicle Rough endoplasmic reticulum 1 m |____| Cell membrane Nucleus Cytoplasm Endoplasmic Reticulum (ER) – rough and smooth – A membranous organelle system in the cytoplasm. The outer surface may be ribosome-studded (rough) or not (smooth). Gogli apparatus –receives newly formed proteins from ER; modifies; directs them to final destination. Mitochondria – respiratory centers, have their own circular DNA, of bacterial origin. Chromosomes – chromatin, histones, centromeres and arms (2 pair in Eukaryotes). Lysosomes – contain acid hydrolases – nucleases, proteases, glycodidases, lipases, phosphatases, sulfatases, phosopholipases. Peroxisomes – use oxygen to remove hydrogen from substrates forming H2O2, abundant in kidney and liver detoxification. Cytoskeleton – an internal array of microtubules, microfilaments, and intermediate filaments that confer shape and the ability to move on a Eukaryote cell.

Eukaryote Membrane Exterior oligosaccharide glycoprotien glycolipid leaflets Fatty acyi tails Integral protein Hydrophilic polar head Peripheral proteins Phospholipid bilayer Hydrophobic core phospholipids Interior 2 kinds of Nucleic Acids (RNA = ribonucleic acid and DNA = deoxyribonucleic acid) Nucleic acid structure: purines: adenine=A guanine=G pyrimidines: uracil=U thymine=T cytosine=C A always pairs with T and C always pairs with G (each pair is called a base pair in double helix DNA) DNA may consist of millions of base pairs A short sequence (<100) is called an oligonucleotide RNA: different sugar (ribose instead of 2’-deoxyribose) Uracil (U) instead of thymine (U binds with A) RNA does not form a complex 3-D structure (like DNA and other protein) Protein = functional and structural units of the cell Central Dogma: DNA  RNA  protein (flow of information is unidirectional)

Gene or DNA transcription RNA molecules synthesized by RNA polymerase. RNA polymerase binds very tightly to promoter. region on DNA. Promoter region contains start site. Transcription ends at termination signal site. Primary transcript: direct coding of DNA  RNA. RNA splicing: introns removed to make mRNA. mRNA has codon sequence that codes for a protein. Uracil replaces thymine Splicing and alternative splicing happens. Translation Transfer RNA (tRNA) makes connection between specific codons in mRNA and amino acids. As tRNA binds to the next codon in mRNA, its amino acid is bound to the last amino acid in the protein chain. When a STOP codon is encountered, the ribosome releases the mRNA and synthesis ends. tRNA links an amino acid to the codon on the mRNA via the anti-codon. rRNA = RNA found in ribosomes Ribosomes = large and small subunit, made of protein and rRNA Initiator tRNA always carries methionine Initiation factors=proteins catalyzing start of transcription Endoplasmic reticulum Post-transcriptional modification .------Chromosomic DNA (gene)--------------------------. ||||||||||||| exon1 |||||||||| exon2 |||||||||||||| exon3 ||||||||||| Promoter intron1 intron2 intron3 transcription . Nuclear RNA . | exon1 |||||||||| exon2 |||||||||||||| exon3| intron1 intron2 translation . mRNA . | exon1 | exon2 | exon3|

Central Dogma DNA RNA Protein deoxyribonucleic acid In eukaryotes, 1 mRNA = 1 protein. (in bacteria, 1 mRNA can be polycistronic, or code for several proteins) DNA in eukaryotes forms a stable, compacted complex with histones (in bacteria, DNA is not in a permanently condensed state Eukaryotic DNA contains large regions of repetitive DNA. (in bacteria, DNA rarely contains any "extra" DNA) Much of eukaryotic DNA does not code for proteins (~98% is non-coding in humans; in bacteria, often less than 5% of genome) Sometimes, eukaryotes can use controlled gene rearrangement for increasing number of specific genes. (in bacteria, happens rarely) Eukaryotic genes are split into exons and introns. (in bacteria, genes are almost never split) In eukaryotes, mRNA is synthesized in nucleus, then processed and exported to cytoplasm. (in bacteria, transcription and translation can take place simultaneously off same piece of DNA Proposed by Francis Crick in 1958 to describe the flow of information in a cell. Information stored in DNA is transferred residue-by-residue to RNA which in turn transfers the information residue-by residue to protein. The Central Dogma was proposed by Crick to help scientists think about molecular biology. It has undergone numerous revisions in the past 45 years. ribonucleic acid Concept of gene is historically defined on basic of genetic inheritance of phenotype (Mendellian Inheritance) DNA of an organism encodes genetic info. It’s made up of double stranded helix composed of ribose sugars Adenine(A), Citosine (C), Guanine (G) and Thymine (T). [note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]

DNA DNA: terminology base: thymine (pyrimidine) monophosphate base sugar: 2’-deoxyribose  sugar 5’ 4’ 1’ (5’ to 3’) 2’ 3’ 3’ linkage nucleoside base:adenine (purine) 5’ linkage no 2’-hydroxyl base phosphate(s) sugar nucleotides (nucleoside mono-, di-, and triphosphates)

DNA: structure DNA is double stranded DNA strands are antiparallel G-C pairs have 3 hydrogen bonds A-T pairs have 2 hydrogen bonds One strand is the complement of the other Major and minor grooves present different surfaces Cellular DNA is almost exclusively B-DNA B-DNA has ~10.5 bp/turn of the helix

Base Nucleoside (RNA) Deoxynucleoside (DNA) Adenine Adenosine Deoxyadenosine Guanine Guanosine Deoxyguanosine Cytosine Cytidine Deoxycytidine Uracil Uridine (not usually found) Thymine (not usually found) (Deoxy)thymidine base sugar nucleoside RNA: terminology RNA can be single or double stranded G-C pairs have 3 hydrogen bonds A-U pairs have 2 hydrogen bonds Single-stranded, double-stranded, and loop RNA present different surfaces

carboxyl group Protein amino group 20 amino acids Peptide bond

Protein structure -helix antiparallel -sheet

DNA RNA Protein The Central Dogma ATGAGTAACGCG TACTCATTGCGC (gene) Replication duplication of DNA using DNA as the template ATGAGTAACGCG TACTCATTGCGC + (nontemplate, antisense) ATGAGTAACGCG TACTCATTGCGC (template, sense) Transcription synthesis of RNA using DNA as the template (mRNA) AUGAGUAACGCG codon tRNA ribosomes Translation (protein) MetSerAsnAla synthesis of proteins using RNA as the template

DNA RNA Protein The Central Dogma Replication Repair and recombination DNA pol and  2. DNA pol and  RNA pol I-ribosomal RNA (rRNA) RNA pol II-messenger RNA (mRNA) RNA pol III-5S rRNA, snRNA, tRNA Transcription RNA processing mRNA splicing rRNA and tRNA processing capping and polyadenylation Translation Post-translational modification phosphorylation methylation ubiquitination

Compartmentalization of processes (transport is important) replication Splicing out introns?

Regulation occurs at each step of a process Initiation (starting) -what is the signal that initiates the process? -what are the factors involved in initiation (cis-and trans-acting)? 2. Elongation (continuation) -how is the process maintained with high fidelity once initiated? -what are the factors involved in elongation (cis- and trans-acting)? 3. Termination (ending) -what is the signal that stops the process? -what are the factors involved in termination (cis- and trans-acting)? Other general regulatory considerations How is the rate of a process regulated? 2. How are the steps regulated in a cell, tissue, or gene-specific manner? 3. Stability of biomolecules 4. Cellular localization of biomolecules

DNA Exceptions to the Central Dogma Nobel Prizes Epigenetic marks, such as patterns of DNA methylation, can be inherited and provide information other than the DNA sequence retroviruses use reverse transcriptase to replicate their genome (David Baltimore and Howard Temin) mRNA introns (splicing) (Philip Sharp and Richard Roberts) RNA editing (deamination of cytosine to yield uracil in mRNA) RNA interference (RNAi) a mechanism of post-transcriptional gene silencing utilizing double-stranded RNA RNAs (ribozymes) can catalyze an enzymatic reaction (Thomas Cech and Sidney Altman) RNA viruses RNA Prions are heritable proteins responsible for neurological infectious diseases (e.g. scrapie and mad cow) (Stanley Pruisner) Protein

The Flow of Biotechnology Information Gene Function > DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAGA > Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNI DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE PDEAEQDCIEFGKKIANI

Upstream (5’) Gene region promoter Downstream (3’) TAC DNA Transcription(gene is encoded on minus strand .. And the reverse complement is read into mRNA) prokaryote (operon structure): Sometimes genes that are part of same operational pathway are grouped together under single promoter - then produce a pre-mRNA which eventually produces 3 separates mRNAs ATG mRNA upstream promoter downstream 5´ UTR 3´ UTR CoDing Sequence (CDS) ATG Translation:tRNA reads off each codon (3 bases at a time) starting at start codon until it reaches a STOP codon. Gene 1 Gene 2 Gene 3 protein Prokaryotes (intronless protein coding genes) • Why does nature bother with mRNA? Why would the cell want to have an intermediate between DNA and the proteins it encodes? • Gene information can be amplified by having many copies of an RNA made from one copy of DNA. • Regulation of gene expression can be effected by having specific controls at each element of the pathway between DNA and proteins. The more elements there are in the pathway, the more opportunities there are to control it in different circumstances. • In Eukaryotes, DNA can then stay pristine and protected, away from caustic chemistry of cytoplasm.

Bacterial Gene Structure of signals - Transcription factor binding site. - Promoters -35 sequence (T82T84G78A65C54A45) 15-20 bases -10 sequence (T80A95T45A60A50T96) 5-9 bases • Start of transcription : initiation start: Purine90 (sometimes it’s the “A” in CAT) - translation binding site (shine-dalgarno 10 bp upstream of AUG (AGGAGG) • - One or more Open Reading Frame • - start-codon (unless sequence is partial) - until next in-frame stop codon on that strand .. • Separated by intercistronic sequences. Genetic Code:How does an mRNA specify an amino acid seq? It would be impossible for each amino acid to be specified by one nucleotide, because there are only 4 nucleotides and 20 amino acids. 2 nucs could specify 16; 3 ~ 64. Each amino acid is specified by up to 6 different combos of 3 nucleotides, called codons, each coding for one amino acid. 1st codon is START, and usually coincides with Methionine. (M which has codon code ‘ATG’) Last codon is STOP, and does NOT code for an amino acid. It is sometimes represented by ‘*’ CoDing region (CDS) starts at START codon and ends at STOP. Different organisms have different frequencies of codon usage. A handful of species vary from this codon association and use different codons for different amino acids. How do tRNAs recognize to which codon to should an amino acid? tRNA has anticodon on its mRNA binding end, complementary to the codon on the mRNA. Each tRNA only binds appropriate amino acid for its anticodon. Bacterial genomes have simple gene structure. - Termination

tRNA( transfer RNA) is a small RNA that has a very specific secondary and tertiary structure such that it can bind an amino acid at one end, and mRNA at the other end. It acts as an adaptor to carry the amino acid elements of a protein to the appropriate place as coded for by the mRNA. T 3-D Tertiary structure tRNA Secondary structure RNA RNA has the same primary structure as DNA. It consists of a sugar-phosphatebackbone, with nucleotides attached to the 1' carbon of the sugar. DNA/RNA differnces are: RNA has a hydroxyl group on the 2' carbon of the sugar (thus, the difference between deoxyribonucleic acid and ribonucleic acid. Instead of using the nucleotide thymine, RNA uses another nucleotide called uracil: Because of the extra hydroxyl group on the sugar, RNA is too bulky to form a stable double helix. RNA exists as a single-stranded molecule. However, regions of double helix can form where there is some base pair complementation (U and A , G and C), resulting in hairpin loops. The RNA molecule with its hairpin loops is said to have a secondary structure. Because the RNA molecule is not restricted to a rigid double helix, it can form many different stable three-dimensional tertiary structures.

Bacterial Gene Prediction Most of the consensus sequences are known from ecoli studies. So for each bacteria the exact distribution of consensus will change. Most modern gene prediction programs need to be “trained”. E.g. they find their own consensus and assembly rules given a few examples genes. A few programs find their own rules from a completely unannotated bacterial genome by trying to find conserved patterns. This is feasible because ORF’s restrict the search space of possible gene candidates. E.g. selfid program(selfid@igs.cnrs-mrs.fr) OPEN READING FRAME: On a given piece of DNA, there can be 6 possible frames. The ORF can be either on + or minus strand and on any of 3 possible frames Frame 1: 1st base of start codon can either start at base 1,4,7,10,... Frame 2: 1st base of start codon can either start at base 2,5,8,11,... Frame 3: 1st base of start codon can either start at base 3,6,9,12,... (frame –1,-2,-3 are on minus strand) Some progs have other conventions for naming frames (0..5, 1-6..) Gene finding in eukaryotic cDNA uses ORF finding +blastx as well. http://www.ncbi.nlm.nih.gov/gorf/gorf.html try with gi=41 ( or your own piece of DNA)

Eukaryotic Central Dogma In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus) The DNA does not contain a duplicate of the coding gene, rather exons must be spliced. ( many eukaryotes genes contain no introns! .. Particularly true in ´lower´ organisms) mRNA – (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a messenger to carry the information stored in the DNA in the nucleus to the cytoplasm where the ribosomes can make it into protein.

Eukaryotic Nuclear Gene Structure • Gene prediction for Pol II transcribed genes. • Upstream Enhancer elements. • Upstream Promoter elements. • GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp) • TATA promoter (-30 nt) (70%, 15 nt consensus (Bucher et al (1990)) • 14-20 nt spacer DNA • CAP site (8 bp) • Transcription Initiation. • Transcript region, interrupted by introns. Translation Initiation (Kozak signal 12 bp consensus) 6 bp prior to initiation codon. • polyA signal (AATAAA 99%,other)

introns • Transcript region, interrupted by introns. Each introns • starts with a donor site consensus (G100T100A62A68G84T63..) • Has a branch site near 3’ end of intron (one not very conserved consensus UACUAAC) • ends with an acceptor site consensus. (12Py..NC65A100G100) UG UACUAAC AG

Exons • The exons of the transcript region are composed of: • 5’UTR (mean length of 769 bp) with a specific base composition, that depends on local G+C content of genome) • AUG (or other start codon) • Remainder of coding region • Stop Codon • 3’ UTR (mean length of 457, with a specific base composition that depends on local G+C content of genome)

Structure of the Eukaryotic Genome ~6-12% of human DNA encodes proteins(higher fraction in nematode) ~10% of human DNA codes for UTR ~90% of human DNA is non-coding.

Non-Coding Eukaryotic DNA • Untranslated regions (UTR’s) • introns (can be genes within introns of another gene!) • intergenic regions. • - repetitive elements • - pseudogenes (dead • genes that may(or not) have been retroposed back in the genome as a single-exon “gene”

Pseudogenes Pseudogenes: Dna sequence that might code for a gene, but that is unable to result in a protein. This deficiency might be in transcription (lack of promoter, for example) or in translation or both. Processed pseudogenes: Gene retroposed back in the genome after being processed by the splicing apperatus. Thus it is fully spliced and has polyA tail. Insertion process flanks mRNA sequence with short direct repeats. Thus no promoters.. Unless is accidentally retroposed downstream of the promoter sequence. Do not confuse with single-exon genes.

Repeats • Each repeat family has many subfamilies. • - ALU: ~ 300nt long; 600,000 elements in human genome. can cause false homology with mRNA. Many have an Alu1 restriction site. • - Retroposons. ( can get copied back into genome) • - Telltale sign: Direct or inverted repeat flank the repeated element. That repeat was the priming site for the RNA that was inserted. • LINEs (Long INtersped Elements) • L1 1-7kb long, 50000 copies • Have two ORFs!!!!! Will cause problems for gene prediction programs. • SINEs (Short Intersped Elements)

Low-Complexity Elements • When analyzing sequences, one often rely on the fact that two stretches are similar to infer that they are homologous (and therefore related).. But sequences with repeated patterns will match without there being any philogenetic relation! • Sequences like ATATATACTTATATA which are mostly two letters are called low-complexity. • Triplet repeats (particularly CAG) have a tendency to make the replication machinery stutter.. So they are amplified. • The low-complexity sequence can also be hidden at the translated protein level.

Masking • To avoid finding spurious matches in alignment programs, you should always mask out the query sequence. • Before predicting genes it is a good idea to mask out repeats (at least those containing ORFs). • Before running blastn against a genomic record, you must mask out the repeats. • Most used Programs: • CENSOR: • Repeat Masker: • http://ftp.genome.washington.edu/cgi-bin/RepeatMasker

More Non-Protein genes • rRNA - ribosomal RNA • is one of the structural components of the ribosome. It has sequence complementarity to regions of the mRNA so that the ribosome knows where to bind to an mRNA it needs to make protein from. • snRNA - small nuclear RNA • is involved in the machinery that processes RNA's as they travel between the nucleus and the cytoplasm. • hnRNA – hetero-nuclear RNA. • small RNA involved in transcription.

Protein Processing & localization. • The protein as read off from the mRNA may not be in the final form that will be used in the cell. Some proteins contains • Signal Peptide (located at N-terminus (beginning)), this signal peptide is used to guide the protein out of the nucleus towards it´s final cellular localization. This signal peptide is cleaved-out at the cleavage site once the protein has reach (or is near) it´s final destination. • Various Post-Translational modifications (phosphorylation) • The final protein is called the “mature peptide”

Convention for nucleotides in database Because the mRNA is actually read off the minus strand of the DNA, the nucleotide sequence are always quoted on the minus strand. In bioinformatics the sequence format does NOT make a difference between Uracil and Thymine. There is no symbol for Uracil.. It is always represented by a ´T´ Even genomic sequence follows that convention. A gene on the ´plus´ strand is quoted so that it is in the same strand as it´s product mRNA.

Change DNA Sequence Change RNA Sequence Change Amino Acid Sequence mRNA Reading Direction Corresponds to Protein Chemical Directionality 3’ 5’ mRNA NH2-terminus COOH-terminus Protein Engineering

Backbone Torsion AnglesDetermine Secondary Structure

Protein Tertiary Structure Tied to Function BiomolecularEnergetics Electrostatic Interactions COO- +H3N Hydrophobic/van der Waals Interactions CH3 H3C OH N Hydrogen Bonding Interactions

Biology Information on the Internet

Biology Information on the Internet • Introduction to Databases • Searching the Internet for Biology Information. • General Search methods • Biology Web sites • Introduction to Genbank file format. • Introduction to Entrez and Pubmed • Ref: Chapters 1,2,5,6 of “Bioinformatics”

Spread-sheet Flat-file version of a database. • Databases: • A collection of Records. • Each record has many fields. • Each field contain specific information. • Each field has a data type. • E.g. money, currency,Text Field, Integer, date,address(text field) ,citation (text field) • Each record has a primary key. A UNIQUE identifier that unambiguously defines this record.

Gi = Genbank Identifier: Unique Key : Primary Key GI Changes with each update of the sequencerecord. Accession Number: Secondary key: Points to same locus and sequence despite sequence updates. Accession + Version Number equivalent to Gi

Relational Database (Normalizing a database for repeated sub-elements of a database.. Splitting it into smaller databases, relating the sub-databases to the first one using the primary key.)

Types of Relational databases. • The Internet can be though of as one enormous relational database. • The “links”/URL are the primary keys. • SQL (Standard Query Language) • Sybase; Oracle ; Access; (Databases systems) • Sybase used at NCBI. • SRS(One type of database querying system of use in Biology)

Indexed searches. • To allow easy searching of a database, make an index. • An index is a list of primary keys corresponding to a key in a given field (or to a collection of fields)

Indexed searches. • Boolean Query: Merging and Intersecting lists: • AND (in both lists) (e.g. human AND genome) • +human +genome • human && genome • OR (in either lists) (e.g. human OR genome) • human || genome

Search strategies • Search engines use complex strategies that go beyond Boolean queries. • Phrases matching: • human genome -> “human genome” • togetherness: documents with human close to genome are scored higher. • Term expansion & synomyms: • human -> homo sapiens • neigbours: • human genome-> genome projects, chromosomes,genetics • Frequency of links (www.google.com) • To avoid these term mapping, enclose your queries in quotes: “human” AND “genome”

Search strategies • Search engines use complex strategies that go beyond Boolean queries. • To avoid these term mapping, enclose your queries in quotes: “human” AND “genome” • To require that ALL the terms in your query be important, precede them with a “+” . This also prevents term mapping. • To force the order of the words to be important, group sentences within strings. “biology of mammals”.

Introduction to Biology: Cell Structure and Central Dogma

Introduction to Biology: Cell Structure and Central Dogma

Presentation Transcript

Introduction to Biology

Introduction to Biology

Introduction to Biology

Introduction to Biology

Introduction to biology

Introduction to Biology

Introduction to Biology

Introduction to Biology

Introduction to Biology

Introduction to biology

Introduction to Biology

Introduction to Biology

Introduction to biology

Introduction to Biology

Introduction to Biology!

Introduction to Biology

Introduction to Biology

Introduction to Biology

Introduction to Biology

Introduction to Biology

Introduction to Biology