The Lion, the Leopard, the Wolf, or the Boar… Why not more?

The Lion, the Leopard, the Wolf, or the Boar… Why not more? Comparative Genomics

Bud Mishra Professor of Computer Science, Mathematics and Cell Biology ¦ Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine

Outline • Biology • Evolution by Duplication • Segmental Duplications in mammalian genomes • Mathematical Challenges in Systems Biology • Compare Genomes All-vs-All • Models: • Blast and SWAT: • MUMMER: • PRIZM • Nondeterminism, Probability and Determinism: Computing the Priors and Mixed Strategies. • Demos: IT WORKS!

Introduction to Biology

Introduction to Biology • Genome: • Hereditary information of an organism is encoded in its DNA and enclosed in a cell (unless it is a virus). All the information contained in the DNA of a single organism is its genome. • DNA molecule can be thought of as a very long sequence of nucleotidesor bases: S = {A, T, C, G}

Complementarity • DNA is a double-stranded polymer and should be thought of as a pair of sequences over S. However, there is a relation of complementarity between the two sequences: • A , T, C , G • That is if there is an A (respectively, T, C, G) on one sequence at a particular position then the other sequence must have a T (respectively, A, G, C) at the same position. • We will measure the sequence length (or the DNA length) in terms of base pairs (bp): for instance, human (H. sapiens) DNA is 3.3 £ 109 bp measuring about 6 ft of DNA polymer completely stretched out!

Inert & Rigid • Complementary base pairs (A-T and C-G) are connected by hydrogen bonds and the base-pair forms a coplanar “rung” connecting the two strands. • Cytosine and thymine are smaller (lighter) molecules, called pyrimidines • Guanine and adenine are bigger (bulkier) molecules, called purines. • Adenine and thymine allow only for double hydrogen bonding, while cytosine and guanine allow for triple hydrogen bonding. • Thus the chemical (through hydrogen bonding) and the mechanical (purine to pyrimidine) constraints on the pairing lead to the complementarity and makes the double stranded DNA both chemically inert and mechanically quite rigid and stable.

J.B.S. Haldane • “If I were compelled to give my own appreciation of the evolutionary process…, I should say this: In the first place it is very beautiful. In that beauty, there is an element of tragedy…In an evolutionary line rising from simplicity to complexity, then often falling back to an apparently primitive condition before its end, we perceive an artistic unity … • “To me at least the beauty of evolution is far more striking than its purpose.” • J.B.S. Haldane, The Causes of Evolution. 1932.

Human Condition

Mer-scape… • Overlapping words of different sizes are analyzed for their frequencies along the whole human genome • Red: 24-mers, • Green: 21-mers • Blue:18 mers • Gray:15 mers • To the very left is a ubiquitous human transposon Alu. The high frequency is indicative of its repetitive nature. • To the very right is the beginning of a gene. The low frequency is indicative of its uniqueness in the whole genome.

Doublet Repeats • Serendipitous discovery of a new uncataloged class of short duplicate sequences; doublet repeats. • almost always < 100 bp • appear to use duplicative machinery that does not have the hallmarks of transposition, segmental duplication, or pseudogene formation. • (Top) . The distance between the two loci of a doublet is plotted versus the chromosomal position of the first locus. • (Bottom) : Distribution of doublets (black) and segmental duplications (red) across human chromosome 2

J.B.S. Haldane (1932): “A redundant duplicate of a gene may acquire divergent mutations and eventually emerge as a new gene.” Susumu Ohno (1970): “Natural selection merely modified, while redundancy created.” EBD

Evolution By Duplication • Tandem: • Polymerase slippage • Unequal crossover: γ-globin duplication mediated by L1. • Interspersed: • Transposons • Exon shuffling • Segmental duplication • Whole genome • S. cerevisiae; early vertebrates; A. thaliana, etc

Duplicated Genes • Mutation during nonfunctionalization (MDN) • Gene sharing, Duplication-degeneration-complementary (DDC) • Dosage compensation • Multi-function constraint

Chromosomal Aberrations • Breakage • Translocation (Among non-homologous chromosomes.) • Formation of acentric and dicentric chromosomes. • Gene Conversions • Amplifications and deletions • Point mutations • Jumping genes a Transposition of DNA segments • Programmed rearrangements a E.g., antibody responses.

Recent Segmental Duplications Human • 3.5% ~ 5% of the human genome is found to contain • segmental duplications, with length > 5 or 1kb, identity > 90%. • These duplications are estimated to have emerged about 40Mya under neutral assumption. • The duplications are mostly interspersed (non-tandem), and happen both inter- and intra-chromosomally. From [Bailey, et al. 2002]

f - - f - - deletion or mutation insertion f + - f + - Duplication by recombination between other repeats or other mechanisms deletion or mutation insertion f ++ f ++ Duplication by recombination between repeats Mutation accumulation in the duplicated sequences The Model

The Mathematical Model Time after duplication 1-α-2β 1-α-2β 1-α-2β h0-- α α α α f - - h1-- γ 2β γ 2β 2β γ h0+- h0 α α α α H0 f + - 1-α-β/2-γ 1-α-β/2-γ 1-α-β/2-γ 2γ β/2 2γ β/2 2γ β/2 h0++ α α α α H1 f ++ h1 h1++ 1-α-2γ 1-α-2γ 1-α-2γ 0 ≤ d < ε ε ≤ d < 2ε (k-1)ε ≤ d < kε h1: proportion of duplications by repeat recombination; h1++: proportion of duplications by recombination of the specific repeat; h1- -: proportion of duplications by recombination of other repeats; h0: proportion of duplications by other repeat-unrelated mechanism; h0++: proportion of h0 with common specific repeat in the flanking regions; h0+-: proportion of h0 with no common specific repeat in the flanking regions; h0- -: proportion of h0 with no specific repeat in the flanking regions; α: mutation rate in duplicated sequences; β: insertion rate of the specific repeat; γ: mutation rate in the specific repeat; d: divergence level of duplications; ε: divergence interval of duplications.

Mechanisms ofSegmental Duplications in Mammalian Genomes • To test and quantify the hypothesis, we designed a Markov model that incorporates: • evolutionary dynamics of the sequence elements • interactions between different elements • Segmental duplication is a multi-mechanism process. • 12% was caused by recombination between the most active interspersed repeats. • Physical instability in the DNA sequences is also involved in duplication mechanism. • The results showed dynamic interaction between duplications at different scales.

TASKS • Paralogs: • Compare one genome against itself • Orthologs: • Compare one genome against another • Compare multiple sets of genomes • Metagenomes:

Grand Challenges in Computational Systems Biology

Mathematical challenges in Systems Biology • How to fully decipher the (digital) information content of the genome • How to do all-vs-all comparisons of 1000s of genomes • How to extract protein and gene regulatory networks from 1 & 2 • How to integrate multiple high-throughout data types dependably • How to visualize & explore large- scale, multi- dimensional data • How to convert static network maps into dynamic mathematical models • How to predict protein function ab initio • How to identify signatures for cellular states (e. g. healthy vs. diseased) • How to build hierarchical models across multiple scales of time & space • How to reduce complex multi- dimensional models to underlying principles

Evolution by Mutation • Mutant gene or DNA sequence: • Substitution, Insertion/Deletion, Recombination, Duplication, Gene Conversion • Spread through a population by genetic drift and/or natural selection and eventually is fixed in a species • Nucleotide substitution can be divided into two classes: • Transitions: Substitution of a purine by purine (A, G) or a pyrimidine by a pyrimidine (C,T) • Transversions: The other types. • More specific properties: • Frameshift mutation, nonsense mutation, synonymous or silent substitutions, non-synonymous or amino acid replacement substitution • Transposons, gene conversions, horizontal gene transfer.

Jukes Cantor • The nucleotide substitution occurs at any site with equal frequency, and that at each site a nucleotide changes to of the three remaining nucleotides with a probability of a per year. • Probability of a change of a nucleotide to another is r = 3 a • qt = Proportion of identical nucleotides between two sequences qt+1 = (1-2r) qt +(2/3) r (1-qt) q(t) = 1 – ¾(1-e-8rt/3)

Kimura 2 Parameters • The rate of transitional substitution per site per year = a • The rate of transversional substitution per site per year = 2b • Total substitution rate, r = a + 2 b • Pt is the proportion of identical transition-type pairs AG, GA, CT, TC • Qt is the proportion of identical transversion-type pairs AG, GA, CT, TC P(t) = (1/4)(1- 2 e-4(a+b)t +e-8b t) Q(t) = (1/2)(1 –e-8b t)

Other Models • Tajima & Nei: • Substitutions that seem to be rather insesnsitive to various disturbing factors. • Tamura: • Takes into account varying GC content • Hasegawa et al. (HKY) • Rzhetsky & Nei • Tamura & Nei

Blast and SWAT:Blast, and the world Blasts with you; SWAT, and you SWAT alone…

Pair-wise Alignment:Task Definition • Given • a pair of sequences (DNA or protein) • a method for scoring the similarity of a pair of characters • Do • determine the correspondences between substrings in the sequences such that the similarity score is maximized

DP Alignment Algorithms • Needleman & Wunsch • Improvement for Affine Gap-penalty, by Smith & Waterman (Swat) • Dynamic programming: • Determine alignment of two sequences by determining alignment of all prefixes of the sequences

Comparing Syntenic Regions Comparison of Human and Fugu orthologous genomic sequence using Plains and EMBOSS. This sequence contains six exon regions. Both Plains and EMBOSS correctly identify the exon region 2 in Fugu with 3 in Human, but only Plains correctly aligns the exon region 5 in Fugu with 5 in Human.

Heuristic Alignment • FASTA [Pearson & Lipman, 1988] • BLAST [Altschul et al., 1990] • BLAST heuristically finds high scoring segment pairs (HSPs): • identical length segments from 2 sequences with statistically significant match scores • i.e. ungapped local alignments • key tradeoff: sensitivity vs. speed

The MUMmer System • Delcher et al., Nucleic Acids Research, 1999 • Given genomes A and B • Find all maximal, unique, matching subsequences (MUMs) • Extract the longest possible set of matches that occur in the same order in both genomes • Close the gaps • Output the alignment

Find Longest Subsequence • Sort MUMs according to position in genome A • Solve variation of Longest Increasing Subsequence (LIS) problem to find sequences in ascending order in both genomes • Requires O(k log k) time where k is number of MUMs

Algorithm Homology Seed Indexing Reference BLAST exact k-mer Scanned with DFA* [31] WABA wobble base degenerate k-mer Array [32] LSH-ALL-PAIRS randomly projected k-mer with <d mismatches Sorted Array [33] BLASTZ discontinuous exact k-mer Hash Table [34][35] PatternHunter discontinuous exact k-mer Hash Table [36] BLAT exact or inexact k-mer Hash Table [37] CHAOS exact or inexact k-mer T-Trie [38] PASH discontinuous exact k-mer Hash Table [39] REPuter maximal exact repeat Suffix Tree [40] FORRepeats exact repeat Factor Oracle [41] The More the Merrier

Summary • Many innovative sequence alignment tools are available for detailed comparative genomic studies. • Recent segmental duplications in mammalian genomes (with identity level >90%) can be detected using BLAST and many other tools. • They use exact or inexact k-mers as homology seeds. • As homology levels become lower, they encounter a dilemma between sensitivity and computational efficiency • homologous segments, • segmental duplications, or • homology-based phylogentic distances.

Summary • To improve sensitivity they rely on exhaustive searches of exact matches with short mers or inexact matches with long mers. • They encounter too many false-positives, to be later filtered through an expensive post-processing step. • Instead, more stringent search criteria (longer mers with more exact matches) may be used to improve efficiency. • Then these algorithms fail to detect low-homology regions, such as ancient duplication events. • In order to detect less-recent duplications, orthologous genes have been used as “anchors” to map out the duplication blocks. But, for obvious reasons, they are unsuitable for identifying duplications that are not subject to strong selection process.

#2 PRIZM(Paxia, Rastogi, Zhou, Mishra) How to do all-vs-all comparisons of 1000s of genomes

PRIZM • It uses a Bayesian scheme. • It is efficient…Linear time. • It computes homologous regions between two genomes even when the homology level drops to a value around 65%. • Incorporates background knowledge about genome evolution, by experimenting with several priors (noninformative improper prior, exponential and Gamma priors, and priors based on Juke-Cantor one parameter and Kimura’s two parameters models of evolution). • The results appear to be unaffected by these choices, while the computational efficiency is only mildly affected by what prior is employed.

A Simple Observation • Homology is hard to compute but easy to verify. Quadratic vs. Linear time. • Can a probabilistic approach replace nondeterminism? If so, we can expect a probabilistic linear time algorithm. • Unlikely! But if we can use priors based on the underlying distributions, there is hope. • Many computational biology problems share this feature!

#2 The genomic sequences under comparison: A) M. genitalium (Y-axis) and M. pneumoniae (X-axis), computed using 300 probes from six iterations, taking about 5 seconds using a non-optimized algorithm

#2 The mouse chromosome 10 and rat chromosome 1 share a syntenic region about 20Mb at the beginning of the two chromosomes (arrow).

#2 Alignment between Mouse Chromosome 8 andhuman chromosome 4. In silico experiment uses 21 mers (5,105,3), and takes about 45 seconds.

Demo Human vs. Mouse M. genitalium vs. M. pneumoniae

Homology Curve • An m-mer is a word of length m, selected from either genome. • Consider a location in the first genome, G1[a] and a short window, starting at a. W1,a = G1[a, a+m−1] • Compare this window with a word of equal length from the second genome starting at G2[b]: W2,b = G2[b, b+m − 1]. • Define the homology level for the locations G1[a] and G2[b] and a window of size m as h(2)(G1[a], G2[b], m) = (1/m) åg=1m-1IG1[a+g] = G2[a + g].

Homology Curve • Let us define, h(a) to denote the highest homology level for genome G1 at position a and computed with respect to G2: h(a) = max1 ·b <|G2| - m h(2)(G1[a], G2[b], m) • The “homology curve” for the first genome, G1 with the respect to the second genome G2 is then defined as: h : [1..G1 −m − 1] → [0, 1] : aa h(a)

PRIZM • Replacing non-determinism with a probabilistic guessing scheme. • The probability distributions are based on biologically meaningful priors. • Using these priors it guesses a local homology curve, and designs and performs an in silico experiment. • It uses the results to verify its guess (in linear time) and refines the local homology curve and the probability distributions for the next iteration.

Probabilistic guesses • Use a Bayesian scheme and a boosting approach to modify the probability distributions of the “guessing experiments” from one iteration to next. • At each iteration, a sequence of words with a specific distribution is selected from one genome, and is optimally partitioned into groups for “in silico experiments” involving • exact-match search, • inexact-match search with one error, • inexact match search with two errors, etc.

Probabilistic guesses • These searches can be efficiently conducted over the second genome, • Assuming that the other genome has been preprocessed and stored in an efficient data structure (e.g., suffix arrays or hash table). • From the results of the experiments, a Bayesian estimator can compute the local homology levels for the genome, and use it to verify and sharpen the probabilistic distributions for the next iteration. • The algorithm converges to the true local homology levels after a few iterations.

In Silico Experiments • Let b, B = | G2 |/b, w, W, m, N0, N1, . . ., Nk (k ≤ m, and in our applications usually k = 2) be some pre-specified parameters. • Choose k+1 random subsets, S0, S1, . . ., Sk, of words each of length m, randomly (i.i.d. uniform) from G1[a, a+w−1], such that |S0| = N0, |S1| = N1, . . . , and |Sk| = Nk. • Consider a block in the second genome of length b: Bb = G2 [b, b+b−1]. Let X0 (X1, . . ., Xk, respectively) be defined as the number of m-mers from set S0 (S1, . . .,Sk, respectively) that match exactly (with one, two and so on up to k mismatches, respectively) to an m-mer in G2[b, b +b − 1].

Experiment Design • By examining the sensitivity (∂(Xi/Ni)/∂h = a′i[h]) we can divide the interval for h into three intervals: [1/4, θ1] ≈ [1/4, (m−2)/m], (θ1, θ2] ≈ [(m−2)/m, (m−1)/m] and (θ1, 1] ≈ ((m−1)/m, 1], such that the choices of (N0,N1,N2) are based on the following mixed strategies: N0 = (K/b) sq21 pi,I(h) dh N1 = (K/3bm) sq1q2 pi,I(h) dh N2 = (2K/9b(m2 –m)) s1/4q1 pi,I(h) dh ….

The Lion, the Leopard, the Wolf, or the Boar… Why not more?