750 likes | 948 Views
Optimal k -mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. k -mer (Sub-)Problems. Enumerate: For all (distinct) k -mers, do Existence:
E N D
Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park
k-mer (Sub-)Problems • Enumerate: • For all (distinct) k-mers, do • Existence: • ...with respect to exact (& inexact) count ¸x • Uniqueness: • ...with respect to exact & inexact match • Near-neighbors: • ...with respect to inexact match • Representation: • Represent (distinct) k-mers for other tools • Fast annotation of k-mer counts on original sequences
Applications of k-mer sets • Peptide Identification • Represent all amino-acid 30-mers • ...that occur at least twice in human dbEST • PCR Primer Design: • Test DNA 20-mer primers for uniqueness • What does it mean to be unique? • DNA sequencing error / repeat detection • Eliminate mers that are too rare or too frequent • Pathogen signatures • Near-neighbors imply potential false-positives
k-mer Superstring Problem Given • A set of sequences S = { S1, ..., Sn } • Sequence database • Word size k Find • A new set of sequences T = { T1, ..., Tm } Such that • Total length of T is minimized, and • T is complete and correct w.r.t. k-mers of S
k-mer Superstring Problem • Completeness • All of the k-mers of S are represented • Correctness • No additional k-mers are present • Minimize the total representation length • Correlates with running time
Shortest (common) superstring problem • General strings (arbitrary length) • Single output string • Completeness for input sequences only • Classical NP-hard problem • Garey and Johnson • Approximate within ~ 2.5*OPT • Max-SNP hard • One of the first algorithmic approaches to genome assembly
de Bruijn Sequences de Bruijn sequences represent all words of length k from some alphabet A. A = {0,1}, k = 3: s = 0001110100 A = {0,1}, k = 4: s = 0000111101011001000
110 011 100 001 000 010 111 101 de Bruijn Graph: A = {0,1}, k = 4 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 0
de Bruijn Sequences & Graphs de Bruijn graphs (k,A): • Edges represent length k words from A • Each node has • in degree |A| • out degree |A| Eulerian tour constructs de Bruijn sequence.
Sequencing-by-Hybridization-graph ACDEFGI, ACDEFACG, DEFGEFGI
Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI
Sequence Databases & CSBH-graphs • Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI
C3 Enumeration • Complete • All k-mers are present • Correct • No other k-mers are present • Compact • No k-mer is present more than once
Correct, Complete, Compact (C3) Enumeration • Set of paths that use each edge exactly once ACDEFGEFGI, DEFACG
Correct, Complete (C2) Enumeration • Set of paths that use each edge at least once ACDEFGEFGI, DEFACG
Patching the CSBH-graph • Use artificial edges to fix unbalanced nodes
Patching the CSBH-graph • Use matching-style formulations to choose artificial edges • Optimal C2/C3 enumeration in polynomial time. • Chinese Postman Problem • Edmonds and Johnson, ’73 • l-tuple DNA sequencing • Pevzner, ’89 • Shortest (Common) Superstring • MAX-SNP-hard, 2.5 approx algorithm
Related work • Chinese Postman Problem • Undirected graph, weighted edges • Shortest path that uses all the edges • Solvable in polynomial time • Construct minimum weighted matching between nodes of odd-degree • Add matching to graph and find Eulerian path • Minimize weight of extra edges used
C2 Enumeration • Chinese postman problem, except: • Directed graph • Add edges from nodes with surplus in-degree to nodes with surplus out-degree • Fixed cost teleportation option • Can always “start” a new sequence • Find optimal set of additional edges • Transportation problem / min cost flow instance
-2 -1 -4 -1 -2 1 3 2 1 3 C3 Enumeration #in-#out #in-#out Cost: k
EHAC FHAC ACD HAC GHAC D Reusing Edges • ACDEHAC, ACDFHAC, ACDGHACD
$ACD EHAC ACD HAC FHAC GHAC D Reusing Edges • C3: ACDEHACDFHAC, ACDGHACD
D EHAC ACD HAC FHAC GHAC D Reusing Edges • C2: ACDEHACDFHACDGHAC
-4 -2 -1 -2 -1 3 2 1 1 3 C2 Enumeration #in-#out #in-#out 4 10 “Shortcut paths” 7
-4 -2 -1 -1 -2 0 0 3 1 2 3 1 C3 Enumeration #in-#out #in-#out Cost: 0 Cost: 0 Cost: k
Enzymatic Digest and Fractionation Sample Preparation for Peptide Identification
Single Stage MS MS m/z
Tandem Mass Spectrometry(MS/MS) m/z Precursor selection m/z
Tandem Mass Spectrometry(MS/MS) Precursor selection + collision induced dissociation (CID) m/z MS/MS m/z
Peptide Identification • For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well • Peptide sequences from protein sequence databases • Swiss-Prot, IPI, NCBI’s nr, ... • Automated, high-throughput peptide identification in complex mixtures
Novel Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications
Novel Mutation • HUPO Plasma Proteome Project • Pooled samples from 10 male & 10 female healthy Chinese subjects • Plasma/EDTA sample protocol • Li, et al. Proteomics 2005. (Lab 29) • TTR gene • Transthyretin (pre-albumin) • Defects in TTR are a cause of amyloidosis. • Familial amyloidotic polyneuropathy • late-onset, dominant inheritance
Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy
Compressed EST Peptide Sequence Database • For all ESTs mapped to a UniGene gene: • Six-frame translation • Eliminate ORFs < 30 amino-acids • Eliminate amino-acid 30-mers observed once • Compress to C2 FASTA database • Complete, Correct for amino-acid 30-mers • Inclusive gene-centric peptide sequence database: • Size: < 3% of naïve enumeration, 20774 FASTA entries • Running time: ~ 1% of naïve enumeration search • E-values: ~ 2% of naïve enumeration search results
Compressed EST Peptide Sequence Database • For all ESTs mapped to a UniGene gene: • Six-frame translation • Eliminate ORFs < 30 amino-acids • Eliminate amino-acid 30-mers observed once • Compress to C2 FASTA database • Complete, Correct for amino-acid 30-mers • Gene-centric peptide sequence database: • Size: < 3% of naïve enumeration, 20774 FASTA entries • Running time: ~ 1% of naïve enumeration search • E-values: ~ 2% of naïve enumeration search results
Sequence Databases & CSBH-graphs • All k-mers represented by an edge have the same count 1 2 2 1 2
CSBH-graph subgraphs • Quickly determine those that occur twice 2 2 1 2
k-mer (Sub-)Problems • Enumerate: • For all (distinct) k-mers, do • Existence: • ...with exact (& inexact) count ¸x • Uniqueness: • ...exact & inexact match • Near-neighbors: • ...inexact match • Representation: • Represent (distinct) k-mers for other tools • Fast annotation of k-mer counts on original sequences
Large scale instances! • CSBH-graph instances • Partition set of all k-mers, determine non-trivial nodes • Days on condor grid (250 CPUs) to construct • ¸ 100,000,000 nodes and edges (sparse & dense) • Min-cost flow instances • ¸ 500,000 nodes and edges • Algorithms must be linear in problem size • Out-of-core Eulerian path algorithm? • Currently testing out-of-core connected-components
Grid computing • Heterogeneous machines • Varying disk/memory/MHz/cores capabilities • Centralized scheduler • Jobs started asynchronously • Other jobs may preempt current job • Input files may need to be staged • 250 simultaneous requests for a 3Gb file? • How to guarantee integrity of input files? • Problem decomposition may be non-trivial • Jobs sizes need to fit the least capable machine • Sometimes need to “game” the scheduler • Need to ensure the integrity of job output
Uniqueness Oracles • Oracle for uniqueness of 20-mers in the Human genome (size: 3Gb) • Count occurrences in the genome: 0,1,2+ • Construct 20-mer superstring for 20-mers with count 1 • Construct 20-mer superstring for 20-mers with count > 1 • Easy(-ish) for exact sequence match: O(n) • Fast automata, hash tables, suffix trees.
Inexact sequence match • Inexact sequence matching O(n*m*k) • Errors/Mismatches (k): 1,2,3 • # distinct 20-mers (m): O(n) • Achieve expected linear time using a hybrid approach (blastn): • Exact search for short chunks of primers • Expensive alignment only where chunks match • Large chunks ) Fast, but miss occurrences • Small chunks ) Slow, find all matches
Inexact sequence match Baeza-Yates Perleberg: • Correct and O(n) for small k • At least 1 chunk is observed with no error. • Small k → Large chunks → Fast and correct • Form of locality sensitive hashing g ≠ = ≠ q
Locality Sensitive Hashing • For each primer: • store a (set of) hash(es) in hash-table • At each position in the genome: • look-up a (set of) hash(es) in hash-table • if any hash is found, do more expensive check • Need to weigh • sensitivity (false negatives) vs • specificity (false positives) • Our application requires speed andno false negatives!
Random Projection • Choose T templates of l random “care” positions g q