720 likes | 867 Views
RECOMBINOMICS : Myth or Reality?. Laxmi Parida IBM Watson Research New York, USA. RoadMap. Motivation Reconstructability (Random Graphs Framework) Reconstruction Algorithm (DSR Algorithm) Conclusion. www.nationalgeographic.com/genographic. www.ibm.com/genographic.
E N D
RECOMBINOMICS:Myth or Reality? Laxmi Parida IBM Watson Research New York, USA
RoadMap • Motivation • Reconstructability (Random Graphs Framework) • Reconstruction Algorithm(DSR Algorithm) • Conclusion
Five year study, launched in April 2005 to address anthropological questions on a global scale using genetics as a tool • Although fossil records fix human origins in Africa, little is known about the great journey that took Homo sapiens to the far reaches of the earth. How did we, each of us, end up where we are? • Samples all around the world are being collected and the mtDNA and Y-chromosome are being sequenced and analyzed phylogeographic question
DNA material in use under unilinear transmission 16000 bp 58 mill bp 0.38%
Missing information in unilinear transmissions past present
Paradigm Shift in Locus & Analysis Using recombining DNA sequences • Why? • Nonrecombining gives a partial story • represents only a small part of the genome • behaves as a single locus • unilinear (exclusively male of female) transmission • Recombining towards more complete information • Challenges • Computationally very complex • How to comprehend complex reticulations?
RoadMap • Motivation • Reconstructability (Random Graphs Framework) • Reconstruction Algorithm (DSR Algorithm) • Conclusion L Parida, Pedigree History: A Reconstructability Perspective using Random-Graphs Framework, Under preparation.
The Random Graphs Framework GRAPH DEF: • Infinite number of verticesarranged in finite sized rows • Edges introduced via a random processacross immediate rows PROPERTIES: Address some topological questions • First, identify a Probability Space • Then, pose and address specific questions(such as expected depth of LCA etc..)
The Random Graphs Framework Wright-Fisher Model • Constant population • Non-overlapping generations • Panmictic • Infinite number of verticeswith a specific organization • Edges introduced via a random processsatisfying specific rules • Address some topological questions • Define a Probability Space • Pose and answer specific questions(such as expected depth of LCA etc..)
Properties of this Pedigree Graph • DAG Directed Acyclic Graph • |E| = O(|V|) for any finite fragment; sparse graph…Vertex-centric view.. • Focus on the flow of genetic material: relevant pedigree graph
Pedigree Graph: GPG(K,N) • K no of extant units • 2N population size/generation • Can the model ignore color of vertex?
Pedigree Graph: GPG(K,N) • K no of extant units • 2N population size/generation • Can the model ignore color of vertex? Forbidden Structure
Probability Space • Space is non-enumerable • Uniform probability measure?WF pop • Probability of some event F(h) for a fixed depth, h, & take limit:
Topological Property of GPG(K,N) Least Common Ancestor (LCA) of ALL (K) extant vertices------TMRCA or GMRCA------- • How many LCA’s ? • Expected Depth of the shallowest LCA
Infinite No. of LCA’s in a GPG(4,3) instance ….. In fact, there exist infinite such instances!
Topological Property of GPG(K,N) Least Common Ancestor (LCA)------TMRCA or GMRCA------- • How many LCA’s ? • Expected Depth of the shallowest “LCA”MEASURE OF RECONSTRUCTABILITY
(Genetic Exchange) Sexual Reproduction vs Graph Model Ancestor without ancestry
Graph Theory vis-à-visPopulation Genetics • Graph Theoretic (topological): • CA common ancestor • LCA Least CA or Shallowest CA MRCA Most Recent CATMRCA The MRCA • Graph Theoretic + Biology (Genetic Exchange): • CAA common ancestor-&-ancestry • LCAA Least CAAGMRCA Grand MRCA Unilinear Transmission
Different Models as Subgraphs Pedigree Graph GPG(K,N)each vertex has 2 parents • Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N)each vertex has 1 parent • MixedSubgraphGPGE(K,N,M)No of vertices/row no more than KMeach vertex has 1 OR 2 parentsM is no. of completely linked segs in each extant unit mtDNA Tree NRY Tree Genetic Exchange Model (ARG)
Different Models GPG(4,8) GPGE(4,8,2) GPTY(4,8)
Different Models as Subgraphs Pedigree Graph GPG(K,N) • Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N) • MixedSubgraphGPGE(K,N,M) LCAg GMRCA LCA h TMRCA LCAg GMRCA
GPGE(K,N,M) hARG • Ancestral Recombinations GraphGriffiths & Marjoram ‘97 • Embellish GPGE(K,N,M) with Genetic Exchanges (GE) • Each extant unit has M segments • No vertex with zero ancestral segments (to extant units)
Mixed Subgraph GPGE(K,N,M) • Plausible GE assignment? • Can GPGE(K,N,M)go colorless? • Yes....through algorithmic subsampling…
Algorithm: Embellish GPGE(K,N,M) • Assign sequence, s, to an instanceeg. s = K, (2K), (2K-7), (2K-15), ………. • Construct M sequences si • Each si is monotonically decreasing; • si[j] no bigger than s[j] • Associate each si with a segment and each element si[j] = k to k randomly selected vertices at depth j
“Topological” Defn of LCAAin GPGE(K,N,M) • Input: GPGE(K,N,M) with GE embellishment • LCAA • CA in all M subgraphs (trees) • Least such CA
Different Models as Subgraphs Pedigree Graph GPG(K,N) • Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N) • MixedSubgraphGPGE(K,N,M) LCAAh GMRCA LCA h TMRCA LCAAh GMRCA
Probability of Instances with Unique LCA/LCAA Pedigree Graph GPG(K,N) • Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N) • Mixed Subgraph GPGE(K,N,M)
“Topological” Defns of LCAA GMRCAhLCAAlLCA & lone pair TMRCA h LCA GMRCA hLCAAlLCA & lone node Pedigree Graph GPG(K,N) • Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N) • MixedSubgraphGPGE(K,N,M)
Expected Depth E(D) of LCA/LCAA Pedigree Graph GPG(K,N) • Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N) • Mixed Subgraph GPGE(K,N,M) O(N2) O(K) O(KM)
RECONSTRUCTABILITY Pedigree Graph GPG(K,N) • Red Subgraph GPTX(K,N)Blue Subgraph GPTY(K,N) • Mixed Subgraph GPGE(K,N,M) O(N2) O(K) O(KM)
Summary:History Reconstruction? • Mixed Subgraph models recombinations Only fragments of the chromosome • In reality, only a minimal structure (HUD) of the GPGE(K,N,M)or ARG can be estimated • Forbidden structures ….
RoadMap • Motivation • Reconstructability (Random Graph Framework) • Reconstruction Algorithm(DSR Algorithm) • Conclusion L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns Journal of Computational Biology, vol 15(9), pp 1—22, 2008 L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium, Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009
INPUT: Chromosomes (haplotypes) OUTPUT: Recombinational Landscape (Recotypes)
Our Approach Granularity g statistical NO Acceptable p-value? YES combinatorial IRiS statistical Analyze Results M Mele, A Javed, F Calafell, L Parida, J Bertranpetit and Genographic Consortium Recombination-based genomics: a genetic variation analysis in human populations,under submission.
Preprocess: Dimension reduction via Clustering 11 12 13 14 15 16 0 17 1 18 4 19 65 20 8 21 9107 22 23 32 24
Analysis Flow Granularity g NO statistical Acceptable p-value? YES IRiS combinatorial Analyze Results statistical
Analysis Flow Granularity g NO statistical Acceptable p-value? YES IRiS combinatorial Analyze Results statistical
IRiS(IdentifyingRecombinationsinSequences) Stage Haplotypes: use SNP block patterns biological insights Segment along the length: infer trees computational insights Infer network (ARG) L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns Journal of Computational Biology, vol 15(9), pp 1—22, 2008
Segmentation 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234511111111111111111111111111111111111111112222222222222222222222222222222222233333333344444444455555555555555----