Chap. 6 Genes, Genomics, and Chromosomes (Part A)

Chap. 6 Genes, Genomics, and Chromosomes (Part A) • Topics • Eukaryotic Gene Structure • Chromosomal Organization of Genes and Noncoding DNA • Transposable (Mobile) DNA Elements • Goals • Learn how genes encoded by complex transcription units are expressed. • Learn the origin, types, and functions of DNA in higher organisms. • Learn the properties of transposons and their roles in gene evolution. RxFISH-painted human chromosomes.

Overview of Human Genes & Chromosomes Human diploid genomic DNA contains ~109 bp divided among 22 autosomes and 2 sex chromosomes. The longest autosome (#1) contains 280 x 106 bp. Only 1.5% of human DNA encodes proteins or functional RNA products. The expressed, coding segments of genes are called exons. Exons are highly conserved in sequence. Noncoding DNA consists of spacer DNA between genes and intron DNA within genes. Noncoding DNA is not strongly conserved and accounts for most of the variations in sequences between individual humans. As discussed later, DNA is highly condensed (overall ~105-fold in mitotic chromosomes) by protein-nucleic acid complexes called nucleosomes and other higher-order structures (Fig. 6.1).

Simple Transcription Units Eukaryotic genes are monocistronic in that only one protein is produced from a given mRNA. However, multiple forms of mRNAs, and therefore proteins, are produced from many genes. Simple gene transcription units produce only one type of mRNA and protein (Fig. 6.3a). Mutations at sites a & b often reduce or prevent transcription. Mutations at site c can change the amino acid sequence of the protein and interfere with its function. Mutations at site d affecting the selection of the exon 2/3 splice site can result in an abnormally spliced mRNA and nonfunctional protein.

Complex Transcription Units Complex gene transcription units produce several species of mRNAs, and thus proteins (Fig. 6.3b). The exon content of mRNAs and domain composition of proteins are varied by selection of alternative splice sites (Top), polyadenylation sites (Middle), and even promoter sites (Bottom). Site selection may vary in different cell types and during different stages of development. The effects of mutations (e.g., c & d) on the gene products synthesized from these transcription units will be discussed in class. About 60% of humans genes are contained in complex transcription units.

Alternative Splicing & Gene Regulation Protein domains can be encoded by a single exon or by a small collection of exons within a larger gene. The coding regions for domains can be spliced in or out of the primary transcript by the process of alternative splicing. The resulting mRNAs encode different forms of the protein, known as isoforms. Alternative splicing is an important method for regulation of gene expression in different tissues and different physiological states. It is estimated that 60% of all human genes are expressed as alternatively spliced mRNAs. Alternative splicing is illustrated in Fig. 4.16 for the fibronectin gene. The fibroblast and hepatocyte isoforms differ in their content of the EIIIA and EIIIB domains which mediate cell surface binding.Twenty different isoforms of fibronectin produced by alternative splicing have been identified.

Human Genomic DNA: Protein-coding Genes Genomic DNA of higher eukaryotes contains 4 main classes of DNA--1) protein-coding genes, 2) tandemly repeated genes, 3) repetitious DNA, and 4) unclassified spacer DNA (Table 6.1). Protein coding genes are grouped into the categories known as solitary genes, and duplicated or diverged genes belonging to gene families. In humans, roughly equal numbers of protein-coding genes occur in these two categories. Groups of homologous duplicated genes form gene and protein families, such as the ß-globin family. (25-30%)

The Human ß-globin Gene Family The ß-globin gene cluster on chromosome 11 is shown in Fig. 6.4a. The ß-globin genes are expressed in different stages of life. , Ag, and Gg are expressed during different trimesters of fetal development (next slide). ß expression begins around birth & continues throughout adult life. Fetal hemoglobin molecules made with the d and G or A polypeptides have a higher affinity for O2 than maternal hemoglobin, facilitating O2 transfer to the fetus. The 5 ß-globin genes are derived from an ancestral ß-globin gene via gene duplication. Over time, these genes accumulated adaptive mutations via sequence drift resulting in the specialized species of ß-globin proteins. Genomic DNA also contains nonfunctional DNA sequences called pseudogenes that are derived from gene duplication or reverse transcription and integration of cDNA sequences made from mRNA (covered below). ß-globin pseudogenes contain introns and thus were derived by gene duplication. Over time these genes became nonfunctional also due to sequence drift. Because they are not harmful, pseudogenes remain in the genome, marking a gene duplication event in an earlier ancestor.

Expression of Human Globin Genes

Exon and Gene Duplication from Unequal Crossing Over Fig. 6.2 illustrates how duplication of genes (e.g., the ß-globins) and exons can occur via unequal crossing over during meiosis and formation of gametes. Exon duplication results in proteins containing repeated domains (e.g., the EGF precursor, Fig. 3.11). In the examples shown, recombination is shown to occur between L1 retrotransposon sequences which are common in genomic DNA.

Modular Domain Structure of Proteins Domains are independently folding and functionally specialized tertiary structure units within a protein. The respective globular and fibrous structural domains of the hemagglutinin monomer (which happen to be individual polypeptide chains) are illustrated above in Fig. 3.10a. Domains (such as the EGF domain) also may be encoded within a single polypeptide chain, as illustrated in Fig. 3.11. Domains still perform their standard functions although fused together in a longer polypeptide (e.g., DNA binding and ATPase domains of a transcription factor). The modular domain structure of many proteins has resulted from the shuffling and splicing together of their coding sequences within longer genes. Epidermal growth factor (EGF) domain

Gene Density in Genomic DNA Higher eukaryotes contain far more noncoding DNA between genes than bacteria and simple eukaryotes (Fig. 6.4). The region of human genomic DNA containing the ß-globin gene cluster shown in the figure actually is a relatively "gene-rich" region of human DNA. Some regions known as gene-poor "deserts" also occur. Higher eukaryotes also contain a larger amount of intron DNA. Although one-third of human DNA is transcribed into pre-mRNA, 95% ends up being degraded after RNA splicing reactions. On average, the typical exon is 50-200 bp in length, while the median length of introns is 3.3 kb in human genes.

Human Genomic DNA: Tandemly Repeated Genes Tandemly repeated genes also are derived by gene duplication. Unlike gene families, the sequences of these duplicated genes are identical or strongly conserved. In addition, they commonly are arranged in a head-to-tail fashion in tandem arrays over a long stretch of DNA. rRNAs and snRNAs (used in splicing reactions, Chap. 8) are representative of this group (Table 6.1). Multiple copies of these genes are needed due to the requirement for vast amounts of these RNAs in the cell. tRNA and histone genes are included in this category, but these genes typically occur in clusters and not true tandem arrays.

Nonprotein-coding Genes in Human Genomic DNA Thousands of genes in the human genome encode functional RNAs (Table 6.2). The functions of several of these are covered in later chapters.

Repetitious DNA Two main categories of repetitious DNA--simple-sequence DNA and interspersed repeats--occur in eukaryotic genomes (Table 6.1). Interspersed repeats are more common and are derived largely from transposons. Simple-sequence DNA is less prevalent, accounting for ~ 6% of human genomic DNA. Simple-sequence DNA is also known as satellite DNA, due to its formation of satellite bands during cesium chloride density gradient ultracentrifugation. The function of this DNA is mostly obscure. It is commonly found at the centromere and telomere regions of chromosomes. (25-30%)

Properties of Satellite DNA Satellite DNA is classified into 3 types based on length. True satellite DNA consists of 14-500 bp sequence units that tandemly repeat over 20-100 kb lengths of genomic DNA. Minisatellite DNA consists of 15-100 bp sequence units that tandemly repeat over 1-5 kb stretches of DNA. Microsatellite DNA consists of 1-13 bp units that can repeat up to 150 times. Microsatellite DNA is thought to originate from “backward slippage” of a growing daughter strand on its template strand during DNA replication (Fig. 6.5).The sequences of repeat units are highly conserved which suggests they perform important functions. Each category of satellite DNA contains a number of different repeat sequences. Simple-sequence DNAs can serve as DNA markers due to variations in repeat number. Satellite DNAs are exploited in FISH (fluorescence in situ hybridization) chromosome staining (Fig. 6.6).

DNA Fingerprinting DNA fingerprinting is a method for identifying individuals based on their minisatellite DNA (Fig. 6.7). It was developed in the mid-80s and is widely used in forensics, paternity analysis, and for research purposes. In the method, minisatellite DNA from a genomic DNA specimen is amplified by PCR using primers that bind to unique sequences flanking minisatellite repeat units. Bands corresponding to each minisatellite locus then are separated on gels. Although satellite DNA is highly conserved in sequence, the number of tandem copies at each loci is highly variable between individuals. This results from unequal crossing over during formation of gametes in meiosis. Due to the variation in the number of repeats at each locus, different individuals can be readily distinguished based on banding patterns.

Chap. 6 Problem 3 Satellite DNA is classified into 3 categories based on length. Satellite DNA consists of 14-500 bp sequence units that tandemly repeat over 20-100 kb lengths of genomic DNA. Minisatellite DNA consists of 15-100 bp sequence units that tandemly repeat over 1-5 kb stretches of DNA. Microsatellite DNA consists of 1-13 bp units that can repeat up to 150 times. Although the sequences of satellite DNA are highly conserved, the number of tandem copies at each locus is highly variable between individuals. This originates due to unequal crossing over during formation of gametes in meiosis (Upper figure). DNA fingerprinting is a method for identifying individuals based on variations in minisatellite DNA (Fig. 6.7). In the method, minisatellite DNA is amplified by PCR using unique primers flanking repeat regions, and the collection of fragments is run on a gel. Due to the variation in the number of repeats at different loci, different individuals can be readily distinguished.

Interspersed Repeats Interspersed repeat DNA comprises the largest fraction of repetitious DNA in eukaryotic genomes. This DNA, which is also called moderately repeated DNA makes up ~45% of human genomic DNA. Interspersed repeat DNA is composed of partial and complete transposon sequences or "mobile DNA". Mobile DNAs were discovered by Barbara McClintock in the 1940s. These sequences move by "transposition". Transpositions in germ line cells are inheritable and occur at a rate of one transposition per 8 individuals. In somatic cells they can cause somatic cell mutations. Mobile DNA has been very important in genome evolution. (25-30%)

Mobile DNA Elements Mobile DNA elements are grouped into two classes, DNA transposons and retrotransposons (Fig. 6.8). DNA transposons move directly as DNA via a "cut-and-paste" mechanism. Retrotransposons move via an RNA intermediate and a "copy-and-paste" mechanism, wherein the original copy of the transposon is preserved. Retroviruses, like HIV, formally are a subclass of retrotransposons that can move between cells because they encode viral coat proteins. DNA transposons predominate in bacteria; retrotransposons are more prevalent in eukaryotes.

Mobile DNA in Prokaryotes Bacteria contain DNA transposons called insertion sequences (Fig. 6.9). IS elements are 1-2 kb DNAs that transpose within the bacterial genome to random locations. Transposition ("jumping") is mediated by an encoded transposase protein. Insertion usually causes gene inactivation and is harmful. Nonetheless, E. coli encodes ~20 types of IS elements. They are tolerated in part due to their low transposition rate (1 in 105 - 107 cells per generation). This rate is set by the low rate of transcription of the transposase gene. IS elements contain inverted repeat sequences of ~50 bp at each end of the protein-coding region that are crucial for transposition.

Mechanism of IS Element Transposition Transposition occurs in 3 main steps, as summarized in Fig. 6.10. The excision of the IS element and its cutting-and-pasting into the target sequence is mediated by the transposase (Steps 1 & 2). The single-stranded DNA regions remaining at the insertion site after transposase action are filled-in and the nicks sealed by cellular DNA polymerase and DNA ligase (Step 3). All transposases we will cover produce staggered cuts at their target sites. This leads to production of short direct repeat sequences immediately flanking the sites of insertion. Eukaryotic DNA transposons jump in genomic DNA by a similar mechanism.

Mechanism of DNA Transposon Copy Number Increase About 3 x 105 copies of full-length and truncated DNA transposons occur in human genomic DNA (3% of DNA). Although DNA transposons move via a cut-and-paste mechanism, their copy number in the genome will increase if they transpose during DNA synthesis preceding the first meiotic division of gametogenesis (Fig. 6.11).

LTR Retrotransposons Eukaryotic retrotransposons fall into two major groups--LTR retrotransposons and non-LTR retrotransposons. Together, these sequences account for 42% of human genomic DNA. LTRs stand for long direct terminal repeats. LTRs consist of 250-600 bp direct repeat sequences located at the ends of the retrotransposon coding region (Fig. 6.12). LTR retrotransposons share many features with retroviruses. They both encode LTRs, reverse transcriptase, and DNA integrase. However, LTR retrotransposons lack coat proteins that allow retroviruses to move between cells. Transposition occurs via an RNA intermediate that is transcribed from a promoter in the left LTR (Fig. 6.13). The primary transcript is polyadenylated, forming the retroviral genomic RNA.

Retroviral & LTR-retrotransposon DNA Synthesis The mechanism by which retroviral and LTR retrotransposon DNA is synthesized prior to integration into genomic DNA is shown in Fig. 6.14. DNA integrase inserts the completed retroviral DNA into genomic DNA via a mechanism similar to that described for bacterial IS elements. Namely, a short direct repeat is produced at each end of the integrated DNA. On the order of 4.4 x 105 LTR retrotransposon sequences occur in human DNA. Most of these are non-functional due to recombination between LTR sequences and deletion of the intervening DNA.

Non-LTR Retrotransposons Even more abundant in human genomic DNA are non-LTR retrotransposon sequences. There are two main classes of non-LTR retrotransposons, known as long interspersed elements (LINEs, ~6 kb), and short interspersed elements (SINEs, ~300 bp). LINEs encode a reverse transcriptase (ORF2) needed for transposition (Fig. 6.16), whereas SINEs do not. Instead SINEs are thought to rely on LINE-encoded enzymes for transposition. LINEs are grouped into L1, L2, and L3 families, of which only L1 is active today. LINE sequences occur at ~9 x 105 copies per human genome. SINEs occur at ~1.6 x 106 copies. The most abundant SINE is the Alu element, which is named based on the fact that it encodes an AluI restriction site. Alu elements were important for gene duplications at the ß-globin locus (Figs. 6.4). poly(A) site promoter site

L/SINE Transposition (I) The mechanism of LINE (and SINE) transposition is illustrated in Fig. 6.17. In summary, LINE primary transcripts are translated into the ORF1 and ORF2 gene products in the cytosol. The RNA then returns to the nucleus with the ORF1 & 2 proteins. These enzymes catalyze reverse transcription and integration of the LINE element at AT-rich regions of genomic DNA. The poly(A) tail of the LINE RNA is used for selection of integration sites. SINE element retrotransposition is thought to be mediated by the ORF1 & 2 proteins encoded by LINEs. 1 Nicking

L/SINE Transposition (II) Many LINEs are truncated at the 5' end due to incomplete reverse transcription of the LINE RNA. For this reason, and sequence drift, only 0.01% of LINE elements are functional today (~100 per genome). It is further thought that LINE & SINE transpositions occur at a rate of ~1 in 8 individuals in the population. LINE transpositions have been implicated in human disease. About 1/600 mutations causing disease can be traced to LINE transposition. However, LINE & SINE transpositions have been crucial in the evolution of the human genome, as discussed in the remaining slides. Lastly, the ORF1 and 2 LINE proteins are thought to be responsible for insertion of processed pseudogenes into genomic DNA.

Exon Shuffling via Recombination Between Homologous Interspersed Repeats We previously have noted that gene evolution has involved exon shuffling between protein-coding genes in the genome. A large amount of shuffling has occurred due to the prevalence of interspersed repeats in the genome. Due to sequence conservation within these regions, crossover events can take place at these sites (Fig. 6.18). This results in exon shuffling between nonhomologous genes and the formation of new genes with new combinations of protein domains. As illustrated in Fig. 6.2, such events also have been important in exon and gene duplications.

Exon Shuffling via Transposition Exon shuffling can also occur via cut-and-paste transpositions mediated by DNA transposons. The mechanism by which this occurs is illustrated in Fig. 6.19a. It requires that two copies of the transposon flank the target exon. Both DNA transposons and the exon will move as one piece of DNA if the transposase happens to cleave DNA at the left inverted repeat of the upstream transposon and at the right inverted repeat of the downstream transposon. Gene 1 ends up losing the exon, and Gene 2 acquires the exon

Exon Shuffling via Transposition Exons can move along with a LINE element when it transposes via its copy-and-paste mechanism (Fig. 6.19b). When a LINE element has a weak poly(A) signal, RNA polymerase II continues to transcribe downstream, potentially through an exon. If this exon has a strong poly(A) signal, then transcription stops and the RNA is polyadenylated. Then following the mechanism in Fig. 6.17, DNA encoding the exon and the LINE element can be incorporated into another gene. The spliced mRNA produced from the acceptor gene may contain the newly introduced exon. Exon shuffling is supported by experimental evidence and the enormous amount of interspersed repeat DNA in genomes. Over billions of years, it has played a major role in evolution of genomes.

Chap. 6 Genes, Genomics, and Chromosomes (Part A)