Genome organization

Eukaryotic genomes are complex and DNA amounts and organization vary widely between species. Genome organization

Genome Organization G

C value paradox: The amount of DNA in the haploid cell of an organism is not related to its evolutionary complexity or number of genes.

Highly Repeated Sequences

There are different classes of eukaryotic DNA based on sequence complexity.

Amount of DNA in a Genome Does Not Correlate with Complexity 105 106 107 108 109 1010 1011 1012 basepairs

How many genes do humans have? Original estimate was between 50,000 to 100,000 genes We now think humans have ~ 20,000 genes How does this compare to other organisms? Mice have ~30,000 genes Pufferfish have ~35,000 Nematodes (C. elegans), have ~19,000 Yeast (S. cerevisiae) has ~6,000 The microbe responsible for tuberculosis has ~4,000

Single Copy SequencesExome

Even the Amount of DNA a Gene Spans Differs Among Species

Problems? Some gene products are RNA (tRNA, rRNA, others) instead of protein Some nucleic acid sequences that do not encode gene products (noncoding regions) are necessary for production of the gene product (protein or RNA). Eukaryotic genes are complex!

Gene Identification • Open reading frames • Sequence conservation • Database searches • Synteny • Sequence features • CpG islands • Evidence for transcription • ESTs, microarrays • Gene inactivation • Transformation, RNAi

Unique genes

Noncoding regions • Regulatory regions • RNA polymerase binding site • Transcription factor binding sites • Introns • Polyadenylation [poly(A)] sites

Splice Sites Eukaryotes only Removal of internal parts of the newly transcribed RNA. Takes place in the cell nucleus Splice sites difficult to predict

One gene, many proteins via alternative splicing , 3’ cleavage and polyadenlyation

Exon Shuffling

Trans-Splicing in Higher Eukaryotes Gingeras, Nature (2009) 461, 206-211

Non-contiguous Transcription Generates An Enormous Number of Possible Transcripts • Trans-splicing exists in higher eukaryotes as well as in lower ones like Trypanosomes Blue: only co-linear Red: all combinations Six 2-exon co-linear combinations from four exons 325 combinations of 3-exons, non-colinear • Reassortment of exons coding for ncRNA or protein domains could dramatically increase number of functional products beyond the number of ‘genes’ Gingeras, Nature (2009) 461, 206-211

Why genome size isn’t the only concern (size doesn’t matter?) • More sophisticated regulation of expression? • Proteome vastly larger than genome? • Alternate splicing • RNA editing • Postranslational modifications? • Cellular location? • Moonlighting

Gene families E.g. globins, actin, myosin Clustered or dispersed Pseudogenes

Pseudogenes Nonfunctional copies of genes Formed by duplication of ancestral gene, or reverse transcription (and integration) Not expressed due to mutations that produce a stop codon (nonsense or frameshift) or prevent mRNA processing, or due to lack of regulatory sequences

Duplicated genes Encode closely related (homologous) proteins Formed by duplication of an ancestral gene followed by mutation • Five functional genes and two pseudogenes

ParalogsvsOrthologs Different members of the globin gene family are paralogs, having evolved one from another through gene duplication. Paralogs are separated by a gene duplication event. Each specific gene family member (e.g. a specific gene in human) is an ortholog of the same family member in another species (e.g. mouse). Both evolved from an ancestral globingene. Orthologs are separated by a speciation event. It is not always easy to distinguish true orthologs from paralogs , especially in polyploid organisms!

Protein - coding sequences less than 1.5% of the genome in humans!

Noncoding RNAs (ncRNA) Do not have translated ORFs Small Not polyadenylated

Functions of Known lncRNAs • Transcriptional interference -lncRNA transcription turns off transcription of nearby gene • Initiation of chromatin remodeling - lncRNA transcription turns on transcription of nearby gene • Promoter inactivation - lncRNA binds to TFIIB and to promoter DNA • Activation of an accessory protein - lncRNA binds to allosteric effector protein TLS and inhibits histone acetyltransferase, decreasing transcription Ponting et al, Cell (2009) 136, 629-641

Functions of Known lncRNAs • Activation of transcription factors - binding of lncRNA to Dlx2 activates Dlx5/6 activity • Oligomerization of an accessory protein - lncRNA induces heat shock factor trimerization • Transport of transcription factors -lnRNA NRON keeps NFAT out of nucleus • Epigenetic silencing of gene clusters -Xist RNA inactivates X chromosome • Epigenetic repression of genes in trans -HOTAIR binds PRC2, leading to methylation and silencing of several genes in HOXD locus Ponting et al, Cell (2009) 136, 629-641

ncRNA • ~97-98% of the transcriptional output of the human genome is ncRNA • Introns • Transfer RNAs (tRNA) • ~ 500 tRNA genes in human genome • Ribosomal RNAs • Tandem arrays on several chromosomes • 150-200 copies of 28S – 5.8S – 18S cluster • 200-300 copies of 5S cluster

Genome Organization - ncRNA The level of transcription from human chromosomes 21 and 22 is an order of magnitude higher than can be accounted for by known or predicted exons Almost half of all transcripts from well-constructed mouse cDNA libraries are ncRNAs (identified because they do not code for an open reading frame of larger than 100 codons)

Repeat sequences – 50% or more of the genome

Repetitive DNA • Moderately repeated DNA • Tandemly repeated rRNA, tRNA and histone genes (gene products needed in high amounts) • Large duplicated gene families • Mobile DNA • Simple-sequence DNA • Tandemly repeated short sequences • Found in centromeres and telomeres (and others) • Used in DNA fingerprinting to identify individuals

Segmental duplications Found especially around centromeres and telomeres Often come from nonhomologous chromosomes Many can come from the same source Tend to be large (10 to 50 kb) Unique to humans?

Repetitive DNA - Segmental duplications

Mobile DNA • Moves within genomes • Most of the moderately repeated DNA sequences found throughout higher eukaryotic genomes • L1 LINE is ~5% of human DNA (~50,000 copies) • Alu is ~5% of human DNA (>500,000 copies) • Some encode enzymes that catalyze movement

Repetitive DNA – Highly repetitive satellite DNA

Genome organization