490 likes | 640 Views
10/19. “Multiple indexes and multiple alignments” Presenting: Siddharth Jonathan Scribing: Susan Tang DFLW: Neda Nategh. Upcoming: 10/24: “Evolution of Multidomain Proteins” Wissam Kazan “ Human Migrations ” Anjalee Sujanani
E N D
10/19 “Multiple indexes and multiple alignments”Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24: “Evolution of Multidomain Proteins” Wissam Kazan “Human Migrations” Anjalee Sujanani 10/26: “Comparison of Networks Across Species” Chuan Sheng Foo “Repetitive DNA Detection and Classification” Vijay Krishnan
CS374Algorithms in Biology Searching Biological Sequence Databases Siddharth Jonathan CS374 Presentation - Searching Biological Sequence Databases
Outline • Background • Problem • Typhon Overview • Typhon Components • Results CS374 Presentation - Searching Biological Sequence Databases
Background • Sequence Alignment • Multiple Alignment Databases • Probabilistic Profile • Phylogenetic Tree CS374 Presentation - Searching Biological Sequence Databases
Sequence Alignment • Identifying regions of similarity in the genome, proteins etc. • Types • Global • Local • Seeded • Non-seeded • Why is it important? • Comparative analysis of genomes • Producing Phylogenetic trees • Understanding newly sequenced genomes CS374 Presentation - Searching Biological Sequence Databases
Seeds – A Review A seed, P = a set of ordered list of w positions i.e. P = {x1, x2, …, xw} w = weight of P = |P| s = span of P = xw – x1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6 CS374 Presentation - Searching Biological Sequence Databases
Indexing in Seeded Local Alignment algorithms Gene Sequence S …G A T T A C C A G A T T A C C A G A T T A … …G A T T A C C A G A T T A C C A G A T T A … …G A T T A C C A G A T T A C C A G A T T A … GATT S,0 Seed A = {0,1,2,3} Average number of seeds indexed per position is called the Budget ATTA S,1 The same idea holds for non-contiguous seeds as well! CS374 Presentation - Searching Biological Sequence Databases
Seeded Local Alignment Algorithms • BLAST • BLAT • BLASTZ • Exonerate • Usage of multiple seeds, spaced seeds • What do they have in common? • Indexing! CS374 Presentation - Searching Biological Sequence Databases
Multiple alignment Species 1 Species 2 CS374 Presentation - Searching Biological Sequence Databases
Phylogenetic Tree CS374 Presentation - Searching Biological Sequence Databases
Probabilistic Profile Each cell corresponds to one position in the alignment… We’ll learn what information it carries very shortly! CS374 Presentation - Searching Biological Sequence Databases
Regions CS374 Presentation - Searching Biological Sequence Databases
The Problem Say, we have a database of multiple alignments Candidate seeds Find local alignments for the query So what’s the challenge? CS374 Presentation - Searching Biological Sequence Databases
The Problem Statement Budget Can we do better? Make use of information implicit in multiple alignment for selecting which seeds to index for a given position CS374 Presentation - Searching Biological Sequence Databases
The Problem Statement - Typhon Given Budget Candidate Seeds Probabilistic Profile Indexing Scheme that indexes only a subset of candidate seeds at each position CS374 Presentation - Searching Biological Sequence Databases
Overall Architecture of Typhon CS374 Presentation - Searching Biological Sequence Databases
Step 1: Probabilistic Profile Construction • 6 tuple for each position in the multiple alignment • Ppresent – existence probability • PA • PC • PT • PG • Pid – Probability that the corresponding query position has the consensus character Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists. Nucleotide with highest such value is called the consensus character CS374 Presentation - Searching Biological Sequence Databases
Calculation of Probabilistic Profile 1 A T C Human _ A 1 C Chimp 1 A T C Rat Pig 1 C T C PPresent=100% PA=75% PC=25% PG=0% PT=0% Propagation of values up the tree to the root is a tricky problem! CS374 Presentation - Searching Biological Sequence Databases
Calculating probabilistic profile • PPresent and PN calculated independently • PPresent Weighted average of children’s PPresent values. • Weights proportional to the inverse of the branch length • PN calculated through Felsentein’s algorithm with a Kimura Matrix • Pid = max(PN) (This is calculated at the root) CS374 Presentation - Searching Biological Sequence Databases
Overall Architecture of Typhon CS374 Presentation - Searching Biological Sequence Databases
Region Decomposition ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T-------G----C-----AT--G------ATGCCCATAAAAT ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT-------G----T-----A-GGG------ATGCCCAAAAAAT ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT 1 2 3 2 1 Each region is characterized by a PPresent and a Pid How do we come up with these regions? CS374 Presentation - Searching Biological Sequence Databases
Hidden Markov Models (HMM) Given an observation sequence Predict the sequence of Hidden states CS374 Presentation - Searching Biological Sequence Databases
Region Decomposition – Simple Method • Come up with a set of region classes (states) • Construct an HMM • Looking at the observation sequence, try to determine the most likely parse • Viterbi algorithm • Problem – Need to determine classes at the beginning CS374 Presentation - Searching Biological Sequence Databases
Alternative • Split the Profile into 2 classes at a time • Use 2 stage HMM • Stop until bound on number of region classes is reached CS374 Presentation - Searching Biological Sequence Databases
Region Decomposition with HMM CS374 Presentation - Searching Biological Sequence Databases
Overall Architecture of Typhon CS374 Presentation - Searching Biological Sequence Databases
Step 3: Seed Indexing What are we trying to do? 1 2 1 3 A D C B C E D A A Candidate Seeds D D B B C C C E CS374 Presentation - Searching Biological Sequence Databases
The Goal • Maximize expected number of regions matched to a homologue CS374 Presentation - Searching Biological Sequence Databases
Seed Assignment • 2 Approaches: • General Method • Greedy Approximation CS374 Presentation - Searching Biological Sequence Databases
General Method - Terminology Size of the candidate set i Region Classes j Object[i][j] CS374 Presentation - Searching Biological Sequence Databases
Calculation of number of matching regions(done for each cell in the previous table) Conditional Probability that the seeds match the region and its homologue given that it exists Probability that a region matches a homologue Number of regions X X Phit |C| ‘PPresent CS374 Presentation - Searching Biological Sequence Databases
General Method - Explained Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases
Some Terminology • Weight • Total Length of all regions in a region class * # of seeds indexed at each position • Sort of like the Budget for a region • Value • Expected Number of Regions matched. (previous calculation) CS374 Presentation - Searching Biological Sequence Databases
Solving the Seed Assignment Problem Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases
Solving the Seed Assignment Problem Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases
Solving the Seed Assignment ProblemBudget =112 Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases
Looks Familiar? • Closely related to the Knapsack Problem, a well studied problem in Computer Science CS374 Presentation - Searching Biological Sequence Databases
Approximate Solution • Faster • Space Efficient • New Terminology : • Density of an object = Value/Weight CS374 Presentation - Searching Biological Sequence Databases
Approximate Solution – General Intuition • Select objects in order of decreasing density • Disallow more than one object per row CS374 Presentation - Searching Biological Sequence Databases
Approximate Method in Action Candidate Set What are the new values of Weight, Value and Density? Object[1,1] Density=V/W=3 Object[2,1] Density=V/W=2 Value = additional number of regions matched Object[3,1] Density=V/W=5 Object[3,2] Density=V/W=6 Weight = amount of budget used by this one seed. Object[4,1] Density=V/W=4 And keep track of the Budget! CS374 Presentation - Searching Biological Sequence Databases
Results • Considerations • Sensitivity • Speed • Space CS374 Presentation - Searching Biological Sequence Databases
Sensitivity Results • Experimental Setup • Detection of Hypothetical Homologous Alignments (HHA) • Typhon Vs Standard CS374 Presentation - Searching Biological Sequence Databases
Sensitivity Comparison CS374 Presentation - Searching Biological Sequence Databases
Effect of Multiple Alignment on Sensitivity CS374 Presentation - Searching Biological Sequence Databases
Running time Comparison • Time spent building the index • Typhon takes longer • Time spent scanning the index • Typhon 3-4 times slower at run time which is reasonable CS374 Presentation - Searching Biological Sequence Databases
Scanning time CS374 Presentation - Searching Biological Sequence Databases
Conclusion • Information implicit from Multiple Alignments helps search sensitivity • Variable allocation of seeds by region classes helps (Typhon) • Space and time complexities of Typhon comparable to STANDARD • Most effective for queries far from each species in the alignment CS374 Presentation - Searching Biological Sequence Databases
Questions? CS374 Presentation - Searching Biological Sequence Databases
Acknowledgements • Serafim Batzoglou , George Asimenos , Jason Flannick CS374 Presentation - Searching Biological Sequence Databases