590 likes | 1.52k Views
Chapter 9 Organization of the Human Genome. General organization of the human genome: Nuclear genome:
E N D
Chapter 9Organization of the Human Genome • General organization of the human genome: • Nuclear genome: - 3200 Mb, ~ 30,000 genes, 4.5% highly conserved including 1.5% coding DNA and 3% of conserved untranslated & regulatory sequences. 90%-95% of the coding DNA is protein coding while the remaining (5-10%) is untranslated (RNA genes). - The coding sequence is present in families of related sequences generated by gene duplication which resulted in pseudogenes and gene fragments. - The 95.5% non-coding DNA of the human genome is made up of tandem repeats (head to tail) or dispered repeats resulting from retrotransposition of RNA transcripts.
Mitochondrial genome:16,569 bp, 37 genes, 44% (G+C), Heavy strand (rich in G), Light strand (rich in C) , and a small section of the genome (7S DNA) is triple stranded (due to repetitive synthesis). • Human cells vary in the number of mt DNA molecules (typically thousands of copies/cell). • Sperms do not contribute mtDNA to the zygote (strictly maternal). During mitosis, mitochondria are passed on to daughter cells by random assortment.
Mt DNA contains 37 genes, 28 use H strand (rich in G) as their sense strand and 9 use L strand (rich in C). • Of the 37 mt genes: 22 are tRNA genes; 2 rRNA (23S rRNA and 16S rRNA); 13 are polypeptide coding (oxidative phosphorylation). • Because mt DNA encodes 13 proteins only, its genetic code has drifted from the universal genetic code. • 93% of mt DNA is coding, all genes lack introns, for some coding sequences are overlapping, some lack stop codons (added post-transcriptionally), replication of H strand starts at the D loop unidirectionally and 2/3 into the mtDNA replication shifts to using the L strand from a new origin of replication and it proceeds in the opposite direction.
Human genome consists of 24 different DNA molecules making 24 chromosomes. - content: DNA, RNA, histones, non- histones. - divided to euchromatic portion (3000Mb) which was used in the Human Genome Project and constitutive heterochromatin (200Mb) which is inactive transcriptionally (found at centromeres, long arm of Y, short arm of acrocentric chromosomes 13, 14, 15, 21, & 22, and secondary constriction of long arm of 1, 19, & 16. - Base composition: average GC = 41%, variable by chromosome. Giemsa bands (dark bands, low GC, 37%; light bands, hi GC 45%). CpG dinucleotides, why are they depleted from vertebrate DNA?
Human gene number: - 30,000 – 35,000 - 14,000 genes/chromosome - most are polypeptide-coding but 5%-10% encode RNA that is untranslated. - C. elegans (1 mm long worm) has 959 somatic cells, genome is 1/30 that of humans, contains 19,099 protein-coding genes & >1000 RNA-coding genes. Therefore, genome complexity is not parallel to biological complexity. • Human gene distribution: - Done by hybridizing CpG islands to metaphase chromosomes. The results showed that gene density in subtelomeric regions & that some chromosomes (19 & 22) are gene rich while others are gene poor (X & 18).
2. Organization, distribution & function of human RNA genes • Nuclear genome: 3000 RNA (non-coding) genes) mt genome: 24 out of 37 genes are RNA coding - Types of RNA genes (Fig. 9.4 & Table 9.3) rRNA & tRNA: involved in translation Other types are involved in RNA maturation (cleavage & base-specific modification of other RNA molecules such as mRNA, tRNA, and rRNA.
rRNA: 700-800 rRNA genes, tandem repeat clusters and many related pseudogenes. - 16S & 23 S rRNA in mitochondria. 4 type of cytoplasmic rRNA, 3 of which associated with large subunit (18S, 5.8S & 5S rRNA) & one with small subunit (18S rRNA). - 18S, 5.8S & 5S rRNA are encoded by a single transcription unit organized in 5 clusters, each with 30-40 tandem repeats located on short arms of chromosomes 13, 14, 15, 21, and 22. About 200-300 5S rRNA genes (others are pseudogenes) in tandem arrays
tRNA: • nuclear: 497 cytolasmic tRNA & 324 tRNA pseudogenes • mitochondrial: 22 • Humans have less tRNA genes than C. elegans & Drosophila. Therefore, organismal complexity is not realted to tRNA gene complexity. • 497 tRNA genes grouped in 49 families according to anticodon specificity. Small nuclear RNA (snRNA genes): • Encoded by families of close to 100 genes. • Many are uridine-rich • Many are spliceosomal RNA required for functioning of splicesomes. Small nucleolar RNA (snoRNA genes): • snoRNA are employed in the nucleolus to guide site-specific base modification in rRNA and snRNA. • Two subfamilies: C/D box snoRNA involved in guide site-specific 2’-O-ribose methylations in rRNA H/ACA snoRNA involved in guide site-specific pseudouridylations of uridine to produce pseudouridine in rRNA.
MicroRNA (miRNA): - 22 nt long derived from a 70 nt precursor containing an containing an inverted repeat which permits double-stranded hairpin RNA formation which is cleaved by a ribonuclease III known as dicer. • Function as antisense regulators of other genes by binding to complementary sequences in 3’UTR inhibiting translation of the protein. • miRNA are developmentally reglated & they themselves control developmental programs. Genes encoding moderate- to large-sized regulatory RNA molecules: • non-coding 7SK RNA, a negative transcriptional regulator of RNA polymerase II elongation. • SRA1 RNA (steroid receptor activator) is a co-activator of several steroid receptors.
3. Organization, distribution & function human polypeptide-encoding genes: (Fig. 9.7) • Human genes show enormous variation in size and internal organization. E.g. Dystrophin gene 2.4 Mb is transcribed in 16 hours - Diversity in exon-intron organization: very small number of genes lack introns. For intron-containing genes, there is an inverse correlation between gene size and fraction of coding DNA. - Diversity in repetitive DNA content: gene have repetitive DNA in introns, flanking sequences, and to different extents in coding sequences (see Table 9.7)
Functionally similar genes are occasionally clustered in the human genome, but are more often dispersed over different chromosomes. - Functionally identical genes: Often encoded by recently duplicated genes e.g. α- globin genes. Very occasionally some genes on different chromosomes encode identical polypeptides. Examples: - Histone genes: a total of 86 genes distributed over 10 chromosomes, albeit with 2 large clusters on short arm of chromosme 6. - Ubiquitin genes: encode a highly conserved 76 amino acid ubiquitin involved in protein degradation and cellular stress response. The genes are distributed over several chromosomes with some as tandem full repeats which are co-transcribed as a polycistronic transcription unit. Other genes are found as monomers.
- Functionally similar genes: closely related but not identical in sequence. Genes are clustered and have arisen by tandem gene duplication (e.g. α-globin and β-globin gene clusters (see Fig. 9.11) - Functionally related genes: genes encode products which may not be so closely related in sequence but are functionally related e.g. subunits of the same protein or components of the same metabolic or developmental pathway. The genes are not clustered and are found on different chromosomes (Table 9.8).
Overlapping genes, genes-within-genes and polycistronic transcription units are occasionally found in the human genome: - Bidirectional gene organization and partially overlapping genes: In humans, in average there is one gene per 100 kb nuclear genome. Occasionally, there is neighboring genes with their 5’-ends separated by a few hundred nucleotides and transcribed from opposite strands e.g. DNA repair genes. - Partially overlapping genes: example the class III region of the HLA complex at 6p21.3 has an average gene density of about one gene per 15 kb and contains several examples of overlapping genes (Fig. 9.8A). - Genes-within-genes: Within the NF1 (neurofibromatosis type I) gene there are three small internal genes transcribed from the opposite strand (Fig. 9.8B) - Polycistronic transcription units: Examples are the human mt genome and the major rRNA gene clusters
Polypeptide-encoding gene families can be classified according to the degree and extent of sequence relatedness in family members - Classical gene families: high degree of sequence homology over most of the gene length or at least the coding sequences e.g. histone gene families and the α- and β-globin gene families. - Gene families encoding products with large, highly conserved domains. See Table 9.9 for examples of such human genes with sequence motifs which encode highly conserved domains. - Gene families encoding products with short conserved amino acid motifs Members of some gene families may not be related at the DNA sequence but encode polypeptides that have a common general function and contain very short conserved sequence motifs such as the DEAD (Asp-Glu-Ala-Asp) and WD (tryptophan-aspartate) motifs (Fig. 9.9).
- Gene superfamilies: members of a superfamily more distantly related (no significant conserved amino acid motifs) than those in classical or conserved motif gene family yet they share general common structuiral features and a general related function: examples include - The immunoglobulin superfamily: includes the immunoglobulin (Ig) genes, T-cell receptor genes, and HLA genes (Fig. 9.10). - The globin superfamily: In addition to the α- and β-globin gene families, which are involved in oxygen transport, it includes equivalent genes which encode muscle and brain globins, myoglobin and neuroglobin (Fig. 9.11). - G protein-coupled receptor superfamily: large and diverse family of receptors that mediate ligand-induced signaling between the extracellular and intracellular environments via interaction with intracellular G proteins. Such receptors have low sequence similarity to each other but share a common structure of seven α-helix transmembrane segments.
Pseudogenes, truncated gene copies and gene fragments are commonly found in multigene families Pseudogenes polypeptide-encoding genes or RNA genes with defective copies of its full coding sequence, portion of it (truncated copies lacking the 5’ or 3’ ends, or internal fragments (e.g. a single exon). Examples of defective gene copies in different gene families: - Nonprocessedpseudogenes in a gene cluster: These are copied at the level of genomic DNA by tandem gene duplication. They contain all elements of a gene but have inappropriate termination codons in exons e.g. pseudogenes in the α- and β-globin gene clusters (Fig. 9.11). - Truncated genes and internal gene fragments in a gene cluster: The class I HLA gene family at 6p21.3 contains such types of defective gene copies. The number of class I HLA genes can vary of different chromosme 6s, analysis of one of these identified 17 family members clustered over 2 Mb and contain 6 expressed genes, 4 conventional full-length pseudogenes, 5 truncated gene copies, and 2 internal gene fragments. This family originated by tandem gene duplications and the fragmented gene copies originated by unequal crossover or unequal sister chromatid exchange (Fig. 9.12).
Nonprocessed pseudogenes in a dispersed gene family: Examples exist in the NF1 gene family (neurofibromatosis type I) located close to the centromere (pericentromeric) and the PDK1 (adult polycystic kidney disease) genes located close the telomere (subtelomeric). Human pericentromeric contain sequences that have been copied recently during evolution and are located on several chromosomes. Subtelomeric regions are unstable and prone to duplication (Figure 9.13)
Processed pseudogenes in a dispersed polypeptide-encoding gene family: Interspersed gene families often have copies of defective gene copies containing the exons, no introns, and at one end contain an oligo (dA)/(T) sequence. They originate by retrotransposition with cellular reverse transcriptase and the cDNA is integrated into chromosomal DNA using the LINE1 transposition machinery (Fig. 9.14). Processed pseudogenes are typically not expressed because they lack a promoter but sometimes they are integrated in the chromosome next to a promoter and are expressed selection pressure may ensure continued expression. Such expressed pseudogenes are known as retrogenes.
4. Tandemly repeated noncoding DNA Occurs in arrays (or blocks) of tandem repeats and could be simple (1-10 nucleotides) or moderately complex (tens or hundreds nucleotides) of sequence which may be a simple one. Arrays occur at a few or many chromosomal locations. - Satellite DNA - Very large arrays of tandemly repeated DNA which are not transcribed - Each repeat may be a simple sequence or a moderately complex one (Table 9.14) - Makes up most of the heterochromatin and is found in the vicinity of the centromeres (pericentromeric heterochromatin) and is transcriptionally inactive. - Satellite I, II, and III are short repeats with different base composition than the rest of the genome. - Alpha satellite or alphoid DNA consists of 171-bp repeat and makes up the bulk of the centromeric heterochromatin. - Centromeric DNA largely consists of various families of satellite DNA (Fig. 9.16). - Alpha satellite is present in all chromosomes and contains a binding site for a specific centromere protein, CENP-B. - Alpha satellite plays an important role in centromere function.
Minisatellite DNA: - moderately sized arrays of tandemly repeated DNA sequences which are dispersed over considerable portion of the nuclear genome (Table 9.14). - Not normally transcribed. - Hypervariable minisatellite DNA: are highly polymorphic and organized in over 1000 arrays (from 0.1 to 20 kb long) of short tandem repeats. Common core repeat is GGGCAGGAXG. Found mainly near telomeres but occur at other locations. Not normally transcribed but very few are. Believed to serve as hot spots for homologous recombination and are used in DNA fringeprinting (where the core sequence is used as a probe to hybridize to multiple loci on different chromosomes loci generating a complex individual-specific pattern). - Telomeric DNA: 3-20 kb of tandem hexanucleotide repeats (TTAGGC). With the aid of telomerase it prtotects the ends of chromosomes from degradation and loss.
Microsattelite DNA or simple sequence repeats (SSR): - Small arrays of tandem repeats of a simple sequence (usually >10 bp). - Interspersed throughout the genome accounting for >60 Mb (2% of the genome). - Originated by replication slippage. - Arrays of dinucleotide repeats are most common & account for 0.5% of the genome. - CA/TG, very common (1 per 36 kb) and highly polymorphic - AT/TA (1 per 50 kb) & AG/CT (1 per 125 kb) are also common - CG/GC are very rare (1 in 10 Mb) because CpG dinucleotides are prone to methylation and subsequent deamination. - Mononucleotide repeats of A and of T are very common but not of G or of C. Trinucleotide and tetranucleotide repeats are rare but highly polymorhic and are used to develop highly polymorphic mrkers. - Function: not well known but CATG adopt altered DNA conformation (Z-DNA) in vitro. - Generally found in intergenic DNA or within introns. However, some are found within coding sequences and act as mutation hot spots because they are prone to replication slippage and unstable expansion.
Interspersed repetitive noncoding DNA - Transposon-derived repeats make up >40% of the human genome and mostly arose through RNA intermediates - known as transposable elements (transposons). - There are four classes of transposon but a very small number are actively transcribed. They are organized into two grouped acording to the method of transposition: 1. Retrotransposons (retroposons). Via RNA transcripts and cellular reverse transcriptase (replicative transposition). Include three types: long interspersed nuclear elements (LINES); short interspersed nuclear elements (SINES); and retrovirus-like elements containing long terminal repeats. 2. DNA transposons. Migrate by conservative transposition. Sequence is excised and re-inserted elsewhere in the genome. - Transposbale elements could be autonomous or non-autonomous (see Fig. 17). - LINES and SINES predominate.
Human LTR transposons: - Include autonomous and non-autonomous retrovirus-like elements flanked by long terminal repeats (LTRs) containing necessary transcriptional elements. - The autonomous members are called endogenous retroviral sequences (ERV) and contain gag and pol genes which encode protease, reverse transcriptase, RNAase H and integrase. Many of these are defective elements and transposition has been extremely rare in the last several million years. - Non autonomous retroviral elements lack the pol gene and often the gag gene as well. See Table 9.15 and Fig. 9.17 • Human DNA transposon fossil: - DNA transposons have terminal inverted repeats and encode a transposase which controls transposition. - All human DNA transposon families are no longer active and so are transposon fossils. (see Fig. 9.17)
Some human LINE1 elements are actively transposing and enable transposition of SINES, processed pseudogenes and retrogenes: - LINEs: - Very successful and have long evolutionary history. Encode reverse transcriptase to ensure their transposition. - 3 types: LINE1; LINE2; and LINE3. Collectively constitute 20% of the genome. - Located in euchromatic regions preferentially in the dark At-rich G bands of metaphase chromosomes. - LINE1 is the only family that is still actively transposing and is predominant (17% of the genome). - LINE1 (L1) is 6.1 kb long and encodes RNA-binding protein and a protein with both endonuclease and reverse transcriptase activity (Fig. 9.18A). Unusually an internal promoter is located within the 5’-UTR and copis carry their own promoter for transposition. - Of the 6000 or so full-length LINE1 sequences, about 60-100 are still capable of transposing and causing disease by disrupting gene function following insertion.
Alu repeats occur more than once every 3 kb in the human genome and may be subject to positive selection • SINEs: - 100-400 bp long and have colonized mammalian genomes. - Some SINEs are primate specific such as the Alu family. - Other SINES are found in marsupials and are known as MIR (mammalian-wide interspersed repeat). - SINEs do not encode any proteins and are not autonomous. - LINEs & SINEs share their 3’ end and SINEs have been shown to be mobilized by neighboring partner LINEs contributing to the wide spread of SINEs. - Mammalian SINEs have originated from copies of tRNA or SRP(7SL) RNA as in Alu repeats. tRNA genes are transcribed by RNA polymersae III and are unusual in that they have internal promoters . However, this internal promoter is not sufficient for active transcription in vivo and appropriate flanking sequences are required for its activation. A newly transposed Alu will become inactive unless it lands in a region which enables the promoter to be active (see Fig. 9.18)
- The full length Alu is 280 bp long and consists of two tandem repeats, each about 120 bp followed by a short sequence rich in A in one strand and T in the complementary. One of the repeats contains an internal 32 bp sequence while it is lacking in the other (Fig. 9.18B) • Alu have high GC content and although dispersed mainly throughout the euchromatic regions of the genome, they are preferentially located in GC-rich and gene-rich R chromosome bands. • In the genome, Alu are confined to introns and untranslated regions.