“Multiple indexes and multiple alignments” Presenting: Siddharth Jonathan Scribing: Susan Tang

10/19 “Multiple indexes and multiple alignments”Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24: “Evolution of Multidomain Proteins” Wissam Kazan “Human Migrations” Anjalee Sujanani 10/26: “Comparison of Networks Across Species” Chuan Sheng Foo “Repetitive DNA Detection and Classification” Vijay Krishnan

CS374Algorithms in Biology Searching Biological Sequence Databases Siddharth Jonathan CS374 Presentation - Searching Biological Sequence Databases

Outline • Background • Problem • Typhon Overview • Typhon Components • Results CS374 Presentation - Searching Biological Sequence Databases

Background • Sequence Alignment • Multiple Alignment Databases • Probabilistic Profile • Phylogenetic Tree CS374 Presentation - Searching Biological Sequence Databases

Sequence Alignment • Identifying regions of similarity in the genome, proteins etc. • Types • Global • Local • Seeded • Non-seeded • Why is it important? • Comparative analysis of genomes • Producing Phylogenetic trees • Understanding newly sequenced genomes CS374 Presentation - Searching Biological Sequence Databases

Seeds – A Review A seed, P = a set of ordered list of w positions i.e. P = {x1, x2, …, xw} w = weight of P = |P| s = span of P = xw – x1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6 CS374 Presentation - Searching Biological Sequence Databases

Indexing in Seeded Local Alignment algorithms Gene Sequence S …G A T T A C C A G A T T A C C A G A T T A … …G A T T A C C A G A T T A C C A G A T T A … …G A T T A C C A G A T T A C C A G A T T A … GATT S,0 Seed A = {0,1,2,3} Average number of seeds indexed per position is called the Budget ATTA S,1 The same idea holds for non-contiguous seeds as well! CS374 Presentation - Searching Biological Sequence Databases

Seeded Local Alignment Algorithms • BLAST • BLAT • BLASTZ • Exonerate • Usage of multiple seeds, spaced seeds • What do they have in common? • Indexing! CS374 Presentation - Searching Biological Sequence Databases

Multiple alignment Species 1 Species 2 CS374 Presentation - Searching Biological Sequence Databases

Phylogenetic Tree CS374 Presentation - Searching Biological Sequence Databases

Probabilistic Profile Each cell corresponds to one position in the alignment… We’ll learn what information it carries very shortly! CS374 Presentation - Searching Biological Sequence Databases

Regions CS374 Presentation - Searching Biological Sequence Databases

The Problem Say, we have a database of multiple alignments Candidate seeds Find local alignments for the query So what’s the challenge? CS374 Presentation - Searching Biological Sequence Databases

The Problem Statement Budget Can we do better? Make use of information implicit in multiple alignment for selecting which seeds to index for a given position CS374 Presentation - Searching Biological Sequence Databases

The Problem Statement - Typhon Given Budget Candidate Seeds Probabilistic Profile Indexing Scheme that indexes only a subset of candidate seeds at each position CS374 Presentation - Searching Biological Sequence Databases

Overall Architecture of Typhon CS374 Presentation - Searching Biological Sequence Databases

Step 1: Probabilistic Profile Construction • 6 tuple for each position in the multiple alignment • Ppresent – existence probability • PA • PC • PT • PG • Pid – Probability that the corresponding query position has the consensus character Conditional Probability that the homologous position contains A,C,T,G given that a homologous position exists. Nucleotide with highest such value is called the consensus character CS374 Presentation - Searching Biological Sequence Databases

Calculation of Probabilistic Profile 1 A T C Human _ A 1 C Chimp 1 A T C Rat Pig 1 C T C PPresent=100% PA=75% PC=25% PG=0% PT=0% Propagation of values up the tree to the root is a tricky problem! CS374 Presentation - Searching Biological Sequence Databases

Calculating probabilistic profile • PPresent and PN calculated independently • PPresent Weighted average of children’s PPresent values. • Weights proportional to the inverse of the branch length • PN calculated through Felsentein’s algorithm with a Kimura Matrix • Pid = max(PN) (This is calculated at the root) CS374 Presentation - Searching Biological Sequence Databases

Region Decomposition ATTGGAACCCAGGCCA----AATT-GCGCC-----AA-TT------G----C-----ATGG-G-----ATGCCCAAAAAAT ATTGGAACTCAGGCCA----AATT--CGCC-----AA-T-------G----C-----AT--G------ATGCCCATAAAAT ATTGGAACCCAGGCCA----AATT-CG--C-----A-TT-------G----T-----A-GGG------ATGCCCAAAAAAT ATTGGAACCCAGGCCA----A-TTGC-G-C-----AAT-T------G-----C----ATGGGG-----ATGCCCATAAAAT 1 2 3 2 1 Each region is characterized by a PPresent and a Pid How do we come up with these regions? CS374 Presentation - Searching Biological Sequence Databases

Hidden Markov Models (HMM) Given an observation sequence Predict the sequence of Hidden states CS374 Presentation - Searching Biological Sequence Databases

Region Decomposition – Simple Method • Come up with a set of region classes (states) • Construct an HMM • Looking at the observation sequence, try to determine the most likely parse • Viterbi algorithm • Problem – Need to determine classes at the beginning CS374 Presentation - Searching Biological Sequence Databases

Alternative • Split the Profile into 2 classes at a time • Use 2 stage HMM • Stop until bound on number of region classes is reached CS374 Presentation - Searching Biological Sequence Databases

Region Decomposition with HMM CS374 Presentation - Searching Biological Sequence Databases

Step 3: Seed Indexing What are we trying to do? 1 2 1 3 A D C B C E D A A Candidate Seeds D D B B C C C E CS374 Presentation - Searching Biological Sequence Databases

The Goal • Maximize expected number of regions matched to a homologue CS374 Presentation - Searching Biological Sequence Databases

Seed Assignment • 2 Approaches: • General Method • Greedy Approximation CS374 Presentation - Searching Biological Sequence Databases

General Method - Terminology Size of the candidate set i Region Classes j Object[i][j] CS374 Presentation - Searching Biological Sequence Databases

Calculation of number of matching regions(done for each cell in the previous table) Conditional Probability that the seeds match the region and its homologue given that it exists Probability that a region matches a homologue Number of regions X X Phit |C| ‘PPresent CS374 Presentation - Searching Biological Sequence Databases

General Method - Explained Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases

Some Terminology • Weight • Total Length of all regions in a region class * # of seeds indexed at each position • Sort of like the Budget for a region • Value • Expected Number of Regions matched. (previous calculation) CS374 Presentation - Searching Biological Sequence Databases

Solving the Seed Assignment Problem Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases

Solving the Seed Assignment ProblemBudget =112 Number of Candidate Seeds 1 2 3 4 5 Region Class 1 Region Class 2 Region Class 3 Region Class 4 CS374 Presentation - Searching Biological Sequence Databases

Looks Familiar? • Closely related to the Knapsack Problem, a well studied problem in Computer Science CS374 Presentation - Searching Biological Sequence Databases

Approximate Solution • Faster • Space Efficient • New Terminology : • Density of an object = Value/Weight CS374 Presentation - Searching Biological Sequence Databases

Approximate Solution – General Intuition • Select objects in order of decreasing density • Disallow more than one object per row CS374 Presentation - Searching Biological Sequence Databases

Approximate Method in Action Candidate Set What are the new values of Weight, Value and Density? Object[1,1] Density=V/W=3 Object[2,1] Density=V/W=2 Value = additional number of regions matched Object[3,1] Density=V/W=5 Object[3,2] Density=V/W=6 Weight = amount of budget used by this one seed. Object[4,1] Density=V/W=4 And keep track of the Budget! CS374 Presentation - Searching Biological Sequence Databases

Results • Considerations • Sensitivity • Speed • Space CS374 Presentation - Searching Biological Sequence Databases

Sensitivity Results • Experimental Setup • Detection of Hypothetical Homologous Alignments (HHA) • Typhon Vs Standard CS374 Presentation - Searching Biological Sequence Databases

Sensitivity Comparison CS374 Presentation - Searching Biological Sequence Databases

Effect of Multiple Alignment on Sensitivity CS374 Presentation - Searching Biological Sequence Databases

Running time Comparison • Time spent building the index • Typhon takes longer • Time spent scanning the index • Typhon 3-4 times slower at run time which is reasonable CS374 Presentation - Searching Biological Sequence Databases

Scanning time CS374 Presentation - Searching Biological Sequence Databases

Conclusion • Information implicit from Multiple Alignments helps search sensitivity • Variable allocation of seeds by region classes helps (Typhon) • Space and time complexities of Typhon comparable to STANDARD • Most effective for queries far from each species in the alignment CS374 Presentation - Searching Biological Sequence Databases

Questions? CS374 Presentation - Searching Biological Sequence Databases

Acknowledgements • Serafim Batzoglou , George Asimenos , Jason Flannick CS374 Presentation - Searching Biological Sequence Databases

“Multiple indexes and multiple alignments” Presenting: Siddharth Jonathan Scribing: Susan Tang

“Multiple indexes and multiple alignments” Presenting: Siddharth Jonathan Scribing: Susan Tang

Presentation Transcript

Chapter 13 Introduction to Multiple Regression

Physiotherapy in the Early Stage of Multiple Sclerosis

MULTIPLE CHOICE

JEOPARDY!

PSYC 6130

Canonical correlation

Renal Failure in Multiple Myeloma

MULTIPLE REGRESSION ANALYSIS

The FLEXBONE Offense

Do Now Title: Shades of Meaning and Multiple Meaning Words

MSA- multiple sequence alignment

Multiple Sclerosis for the Non-Specialist

The Multiple Regression Model

MULTIPLE INTEGRALS

Multiple Alignment

PROTEIN PATTERN DATABASES

Least Common Multiple and Least Common Denominator

Unit – 3 MULTIPLE ACCESS

Multiple 4-2-5 Monster Defense

novel biomarkers in multiple sclerosis