Introduction

Introduction

Uncertainty in Homology Inferences: Assessing and Improving Genomic Sequence Alignment Gerton Lunter, Andrea Rocco, Naila Mimouni, Andreas Heger, Alexandre Caldeira, and Jotun Hein • Sequence evolution and alignment • Context dependent substitution models • Indels • Statistical alignment

Non-coding DNA • Identification • Estimation • Evolution (adaptation?) • Expression (where? function?)

Evolution of expression levels How do gene expression levels evolve over time? Is selection or drift the main factor? Collaboration with Philipp Khaitovich (Shanghai); Raphaelle Chaix (Paris) • Structured Coalescent Population genetics with geographical structure (E.g. island populations; HIV transmission in different risk groups) Collaboration met Chris Holmes (Oxford), Oliver Pybus (Oxford), Alexei Drummond (New Zealand), Andrew Rambaut (Edinburgh)

New sequencing technologies • ABI 1100 bases / read, 1 Mb / day • Roche / 454 250 bases / read, 100 Mb / 7.5 hour • Solexa / Illumina 35-50 bases / read, 2 Gb / 6.5 hour

This week Genome structure and evolution -- general introduction, biological motivation Bayesian statistics; Stochastic processes; substitutions Pairwise alignments; probabilistic alignments; hidden Markov models Applications: Neutral indel model; amount of functional sequence in human genome; positive selection on non-coding DNA Population genetics; Lewontin’s “paradox of variation”; modelling geographic population structure

Genome structure and evolution

1. Genome structure – Double helix

1. Genome structure - Chromatin DNA is packed into chromosomes in a hierarchical way: DNA double helix is coiled around histone octamers. About 165 nucleotides wrap around a single octamer (wrapping 2.85 times). These “beads” are separated by 50 nt long spacer sequence. Histone beads pack into 30 nm fibers Fibers are tied up into scaffolds Condensed scaffolds make up the macroscopic form of chromosomes.

1. Genome structure - Chromosomes Human cells have 46 chromosomes: 22 normal chromosomes (autosomes), in pairs (from father and mother), and two sex chromosomes (X from the mother, X or Y from the father). In preparation for normal cell division (mitosis), chromosomes are replicated, but remain joined at their centromere (prophase). This gives the chromosomes their “X” shape. Both “halves” of the X are called (sister) chromatids. When cells are not replicating, this is the usual form of chromosomes. http://biology.unm.edu/ccouncil/Biology_124/Summaries/Sex.html

1. Genome structure – DNA methylation • DNA methylation is an epigenetic marker that controls / regulates many biological functions: • Control of gene expression • Control of DNA replication • Control of the cell cycle • and more • Cytosines are methylated by enzyme (C-methyltransferase), which targets CpG pairs (Cytosine, phosphate, Guanine). • Methylation patterns are established during early development, and maintained over many generations by maintenance methyltransferases copying the methylation status to a newly synthesized strand (note: CpG is its own reverse complement).

1. Genome structure – Histone modifications • Histones have tails which can be modified in various ways, and at several locations. Each (combination of) modifications has a different biological function (“Histone code”). • Histones are involved in many essential biological processes including • Gene regulation • DNA repair • Chromosome condensation / mitosis • “Until the early 1990s, histones were dismissed as merely packing material for nuclear DNA” (Wikipedia). Extreme conservation of histone proteins (found all the way back to archaea) suggests that they are involved in important biological pathways. Nature447, 433-440(24 May 2007) http://chemistry.gsu.edu/faculty/Zheng/

1. Genome structure – Sequence structure What does the human genome sequence consist of? Total size: 2,858,160,000 bp • Protein-coding genes: About 20,500 • Protein-coding exons: About 220,000; cover 1.2% of genome • Transposable elements: About 45% of genome • Tandem repetitive sequence: Few % • Heterochromatin: Few % • Unknown: About half. Conserved: About 5% Biologically functional: ? (>5%)

1. Genome structure – GC content GC content of mammalian genomes is variable, and shows long-distance structure. Regions of “fairly homogeneous” GC content are called “isochores”. Cannot be defined exactly, but reflect the fact that the genome does show compositional discontinuities. (Clay & Bernardi, Trends in Biotechn 2002, 20(6), p. 237.)

1. Genome structure - Genes Darwin (1809-1882) used the term “gemmule” to denote a microscopic unit of inheritance. Major problem in his day: why do traits not “blend out” by mixing. Mendel (1822-1884) first to suggest the existence of factors conveying traits from parent to offspring, and the pattern of their inheritance (e.g., two copies per individual, one from each parent; segregation during gamete production; different traits segregate independently), solving the problem of blending. 1889: Hugo de Vries coined term “pangen”, later shortened to “gene”. 1910: Thomas Hunt Morgan: genes reside on specific chromosomes 1941: Specific genes code for specific proteins. “One gene one enzyme” hypothesis. 1977: Roberts and Sharp discover introns 2003: Genes often overlap; single genes have multiple product.

1. Genome structure - Genes Upstreamregion • Eukaryotic protein-coding genesconsist of: • Upstream region (with regulatory signals) • Promoter region, with transcription initiation site (e.g. TATA box) • 5’ untranslated region (5’ UTR) • Translation initiation site (includes start codon) • Alternating sequence of exons (protein- coding) and introns • Translation stop site (stop codon) • 3’ UTR • Polyadenylation (poly-A) signal • Translation stop site Promoter 3’ UTR

1. Genome structure – Transposable elements TEs are “selfish genes” which when activated can insert copies of themselves into the genome. When this happens in the germline, these insertions are transmitted to the next generation. Vast majority of TEs can be classified into four families, based on the mechanism by which they copy themselves:- LINEs (Long Interspersed Nuclear Elements, autonomous)- SINEs (Short Interspersed Nuclear Elements, use LINE proteins for life cycle)- LTR elements (Long Terminal Repeats; derived from retroviruses)- DNA transposons (replicate without RNA intermediary)

1. Genome structure – Transposable elements TEs were discovered by Barbara McClintock in the 1950s, in maize where they are very active. In human somatic cells, TE insertions can cause disease. TEs are mostly neutral or deleterious. Despite most not being useful (for us) so that there is no selection pressure to keep them (in the human population), many have remained just by chance. They are useful as proxy for neutrally evolving sequence. A small proportion of TE-derived sequence has in fact been recruited into useful bio-logical roles, and is now highly conserved.

1. Genome structure – Transposable elements Age of a TE can be determined (approximately) by counting average number of substitutions from the consensus sequence, supposed to be the ancestral state. Histogram of TEs versus age shows the activity over time. Alus have been very active, but recently things have quited down in human.

2. Genome evolution - introduction In the course of a human lifetime, the genome is used, damaged, repaired, copied and handed down to offspring cells dozens of times. In the process, the genome is changed. This change is called a mutation. At first this involves just a single individual. If the change is has no phenotypic consequence, no selection acts against (or for) the mutation, and chance determines whether the individual’s offspring will carry the mutation, and so on. The process by which the frequency of the mutation changes in the population is called (random)genetic drift. Of all neutral mutations in a population of 2N haploid genomes, a fraction 1/2N will eventually spread through the entire population. The mutation is said to have gone to fixation. Mutations that have a beneficial effect have a (much) larger probability of getting fixed (once they reach a non-negligible population frequency), while deleterious mutations have almost no chance of going to fixation. Mutations that have become fixed in the population are called substitutions. (Note: “substitutions” usually refer to single nucleotide substitutions, but the term “indel substitution” is also used.) When comparing genomes from different species, what you see are all the fixed mutations (substitutions) that have occurred since the two species split. Mutations that are reside in either of the two individuals whose genomes were sequenced are called polymorphisms, and will also be included. Usually these form a small proportion and the distinction is ignored (but note that polymorphisms may well be deleterious, while substitutions rarely are).

2. Genome evolution – nucleotide substitutions Basically two causes: damage, and copy errors during replication. The two causes can be teased apart by comparing species with different generation times. More generations per unit of time mean more copying errors, while the rate of damage might stay relatively constant. Errors are recognized and repaired by specific and highly efficient repair mechanisms. Resulting error rate is low: about 3x10-8 per nucleotide per generation in humans. The repair mechanism is extremely important: damage to this system increases the likelihood of getting cancer. The rate of mutagenesis is higher in males than in females (see e.g. Berlin et al., J Molec Evol 62(2) 226-233), probably due to more cell divisions in the male germline. This results in low mutation rates on the X, and high mutation rates on the Y chromosome. In mammals, the rate of transitions (pyrimidine-to-pyrimidine or purine-to-purine) is about twice higher than the rate of transversions (pyrimidine-to-purine or vice versa).

2. Genome evolution – CpG mutation rate Methylation of Cytosine (mC) involves adding a methyl group (CH3) on to the C5 carbon. Accidental de-amination of the C4 carbon turns a mC into a normal Thymine. This results in a mismatch, but the “wrong” base cannot be identified, since both are in the “alphabet”. Result: substitution rate on CpG dinucleotides is about 15x higher than for ordinary C’s or G’s. (The same process on the reverse-strand mC causes a high mutation rate on the “G”). Over time, this causes CpGs to be about 4x underrepresented compared to the expectation based on C and G frequencies. For sequences that are not methylated (in the germline), this mechanism does not apply, resulting in “high” (i.e. normal) levels of CpG in so-called “CpG islands”. These are often promoters of ubiquitously expressed genes.

2. Genome evolution – Transcription-coupled repair • When RNA polymerase II encounters a mutated nucleotide, it stops. This triggers the TCR pathway which repairs the mutation. • Failure of TCR leads to Cockayne syndrome, extreme form of accelerated aging. • TCR is strand-asymmetric (mutations in the untranscribed strand are not corrected by TCR), and leads to asymmetric mutation rates in transcribed regions.

2. Genome evolution - Indels CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA When the ancestral sequence is not known, insertions and deletions cannot be distinguished, and are often referred to as “indels”. Indels form an important source of sequence change – more on this later. Most small indels are in fact deletions (by a factor 3 in human). Indels can have any size, up to several Mb. The majority are 1 nt indels. Indel Indel Indel

2. Genome evolution – Indel mechanisms During replication, the template and copy can become separated. If this happens in a tandem-repetitive region, there is a possibility of incorrect re-pairing (slippage) This can lead to both short insertions and deletions. Long stretches of short-period tandem repeats (microsatellites) are particularly prone to slippage. This is the reason behind the fast evolution of microsatellite length. The gene encoding for huntingtin contains a repeat region of CAG triplets. Expansion of the number of CAG units beyond 36 causes Huntington’s disease. http://www.sci.sdsu.edu/~smaloy/

2. Genome evolution – indel mechanisms Recombination between direct repeats in a single chromosome leads to a (potentially Mb size) deletion. Recombination requires (near) sequence identity over fairly large region (100s nt?), so these deletions are mostly not very small. Unequal recombination (involving similar or identical regions at different chromosomal locations) can also lead to insertions (segmental duplications). In the picture, unequal recombination between sister chromatids at replication is shown. The same process may also happen between parental (homologous) chromosomes. http://www.sci.sdsu.edu/~smaloy/

2. Genome evolution – Recombination Mechanism of recombination: 1. Double-stranded break (DSB) formation 2. Broken ends get digested 3. Single strands invade region with high sequence similarity 4. Repair and re-synthesis  Holliday junction 5. Holliday junction resolution: Crossing over (black arrows), or… … NO crossing over (grey arrows) Gabriel Marais, Trends Genet, 19(6)2003 

2. Genome evolution - types of recombination • Double-stranded breaks appear: • Accidentally (somatic & germ cells) • “Repair” recombination • Deliberately (germ cells at meiosis) • “Sexual” (or “meiotic”) recombination • Different (but overlapping) pathways • Preference for: • sister chromatid in repair recombination • parental chromosome in sexual recombination • Recombination is obligatory during meiosis. Rate of recombination is >1 per generation per chromosome.

2. Genome evolution – Gene conversion • Gene conversion = copying of one stretchof DNA into another • Single-stranded DNA can invade sister chromatid • Identical DNA, so no mutations • If single strand invades parental chromosome: • Without crossing over: gene conversion • With crossing over: gene conversion + recombination • When the nicked strand invades a non-homologous but sequence-similar region (as in unequal recombination), gene conversion causes “sideways copying” of genetic material. Causes similarities to increase / persist. • The effect of gene conversion (without recombination) on the genome sequence is equivalent to two recombination events happening close to each other (order 1kb).

2. Genome evolution – Biased Gene Conversion Mutation bias for GC as a side effect of gene conversion • Two repair mechanisms: • Base Excision Repair (BER) • Targets “hetero-mismatches”, AG, TG, AC, TC • Efficient; replaces just one base • Favours GC • Nucleotide Excision Repair (NER) • Targets AA, CC, TT, GG mismatches • Digests ~1kb, resynthesizes • Favours unbroken strand, no nucleotide bias • Second source of biased gene conversion • AT sites seem to be target for DSBs in sexual recombination • DSB strand gets digested, copied back from other allele • Result: bias towards GC

2. Genome evolution - Recombination hotspots • Rate of recombination is measured in centiMorgans (cM). Two genetic loci are 1 cM apart if 1 recombination per 100 generations occurs between them. • Recombination rate not uniform: • Background rate ~0.04 cM/Mb • Average rate ~1 cM/Mb • 0.5% of genome >15 cM/Mb • Recombination hotspot = gene conversion hotspot • Cause of hotspots not known: • CCTCCCTmotif? • Bias for high GC • One mutation can change hotspot activity • “DNA2” locus in MHC region,CT suppresses hotspot • Perhaps differences in recombination rates have, over time, caused the current isochore structure through biased gene conversion. Myers, Bottolo, Freeman, McVean, Donnelly, Science 310 Oct 2005

2. Genome evolution – double stranded break repair Accidental breaks are also repaired through the non-homologous end joining (NHEJ) pathway. Does not require homologous sequence. Evolutionary very old pathway: yeast and some bacterial species have NHEJ. Repairs most breaks correctly, but is also able to induce translocations (chromosome rearrangements). Gill and Fast BMC Molecular Biology 2007 8:24 doi:10.1186/1471-2199-8-24

2. Genome evolution – chromosomal rearrangements Mouse chromosomes (1-19 and X) coloured according to homology with human chromosomes (1-22 and X). In the about 2 x 80 million years that separate humans and mice, many chromosomal rearrangements have occurred.

Introduction

Introduction

Presentation Transcript

Introduction to introduction to introduction to … Optimization

INTRODUCTION/ INTRODUCTION

Introduction

INTRODUCTION

Introduction

Introduction