Comparative Genome Maps

Comparative Genome Maps CSCI 7000-005: Computational Genomics Debra Goldberg debg@hms.harvard.edu

What is a comparative map?

Why construct comparative maps? • Identify & isolate genes • Crops: drought resistance, yield, nutrition... • Human: disease genes, drug response,… • Infer ancestral relationships • Discover principles of evolution • Chromosome • Gene family • “key to understanding the human genome”

Why automate? • Time consuming, laborious • Needs to be redone frequently • Codify a common set of principles • Nadeau and Sankoff: warn of “arbitrary nature of comparative map construction”

Definitions • Marker: identifiable chromosomal locus • Homology: genes with common ancester • Homeology: chromosomal regions derived from a common ancestral linkage group • Synteny: loci on the same chromosome • Colinearity: syntenic regions with conserved gene order

Input/Output • Input: • genetic maps of 2 species • marker/gene correspondences (homologs) • Output: • a comparative map • homeologies identified

3S 8L 10L 3L Map construction Go from this to this Maize 1 (target), Rice (base) Wilson et al. Genetics 1999

Maize 1 Rice 3S 8L 10L 3L Chromosome labeling Maize 1 (target), Rice (base) Wilson et al. Genetics 1999

Maize 1 Rice 3S 8L 10L 3L A natural model? Maize 1 (target), Rice (base) Wilson et al. Genetics 1999

m s Scoring 10L 3L

Accept published marker order All linkage groups of base are unique Simplistic homeology criteria At least one homeologous region Assumptions

A natural model?

Dynamic programming • li = location of homolog to marker i • S[i,a] = penalty (score) for an optimal labeling of the submap from marker ito the end, when labeling begins with label a a 1 ... i ... n

a b ... ii+1 ... n lili+1 ln Recurrence relation S[n,a] = m (a, ln)S[i,a] = m (a, li) +min(S[i+1,b] + s (a,b) ) a ... n ... ln bL

a-b-c motif: a b c score: 2s = 4 a a abbbc c c a-b-a motif: a score: 3m = 3 a a abbba a a Problem with linear model s = 2

The stack model • Segment at top of the stack can be: • pushed (remembered), later popped • replaced • Push and replacecost s -- pop is free. d c c e f a b b b

uaz265a (7L) isu136 (2L) isu151 (7L) rz509b (7L) cdo59c (7L) rz698c (9L) bcd1087a (9L) rz206b (9L) bcd1088c (9L) csu40 (3S) cdo786a (9L) csu154 (7L) isu113a (7L) csu17 (7L) cdo337 (3L) rz530a (7L) 7L m m m 9L “free” pop s 7L Scoring

Dynamic programming • S[i,j,a] = score for an optimal labeling of: • submap from marker ito marker j • when labeling begins with label a -- i.e., marker iis labeleda a 1 ... i ... j ... n

a 1 ... ii+1 ... n a b 1 ... ii+1 ... n a a 1 ... i ... k+1... j ... n Recurrence relation • S[i,i,a] =m (a, li) • S[i,j,a] =min: m (a, li) +min(S[i+1,j,b] + s (a,b) ) minS[i,k,a] + S[k+1,j,a] bL i<k<j

Stack Results: infers evolutionary events Wilson et al. Maize 1 (target) Rice (base)

8p 8p 19p = 19p Problem: Incomplete input • Gene order not always fully resolved. • Co-located genes can be ordered to give most parsimonious labeling.

The reordering algorithm • Uses a compression scheme • Within a megalocus, group genes by location of related gene. • Order these groups • First, last groups interact with nearby genes • Any ordering of internal groups is equally parsimonious

The reordering algorithm

Definitions  extended to distance to a set A of labels 0 if a  A, 1 otherwise S = the set of indices of supernode start elements For simplicity, call supernode i  S (a, A) =

Definitions For i  S: • ni = # markers in i • ni(a) = # markers in i with a homolog on a • li = set of labels matching markers in i • li = {a  L |ni(a)  1},

s : mni(c)  s mni(c) : mni(c)  s pi(c) = Definitions • pi(c) gives mismatched marker and segment boundary penalties for label c

Definitions • p(i,a,b) gives the total mismatched marker and segment boundary penalties attributed to “hidden markers”  (pi(c)) + m i(a,b) : for iS, ab p(i,a,b) =  (m ni(c)) + m i(a,b) : for iS, a=b 0 : otherwise. c  a,b c  a

Definitions For i  S: • i(a,b) = # labels in {a,b} without matching marker in i • i(a,b) = (a, li) + (b, li) • i(a,b)  {0,1,2}

Definitions • i (a,b) corrects if mismatch marker penalties assigned twice for same marker; in the recurrence and in p(i,a,b) • For example: • i (a,b) = 0 if i(a,b) = 0(if a, b are both represented in supernode) • i (a,a) = -2 if i(a,a) > 0(if a is not represented in supernode)

Recurrence relation S[i,i,a] =m (a, li) • S[i,j,a] = min: • m (a, li) + min (S[i+1,j,b] + s (a,b) + p(i,a,b)) • minS[i,k,a] + S[k+1,j,a] bL i<k<j k  S

Results: Fewer mismatches stack reordering Mouse 5 (target) Human (base)

Results: Mismatches placed between segments stack reordering Mouse 8 (target) Human (base)

Results: Detects new segments stack reordering Mouse 13 (target) Human (base)

Summary • Finds optimal comparative map • Arranges markers in most parsimonious way • First algorithm to use megalocus data • Fast, objective, simple to use • Biologically meaningful results

Summary • Global view • Biologically meaningful results • Provides testable hypotheses • Robust • not species-specific • high/low resolution, genetic/physical maps • stable to errors in marker order

Future Directions • Algorithmic extensions • 3rd species • polyploidy • search for ancient duplications • Deduce history of evolutionary events • makes genome rearrangement measures tractable and robust • infer common ancestor

Future Directions • Block-segmental sequence comparisons • non-local sequence alignment • protein domains • 2D block-segmental comparisons • comparison of regulatory networks • image processing

Acknowledgments • NSF • AAUW • David and Lucile Packard Foundation • USDA • Cooperative State Research Education and Extension Service • ONR • Jon Kleinberg • Susan McCouch • Chris Pelkie • Sandra Harrington • Sam Cartinhour • Dave Schneider

Comparative Genome Maps