1 / 77

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Mining Sequence Patterns in Biological Data. A brief introduction to biology and b ioinformatics Alignment of biological sequences Hidden Markov model for biological sequence analysis Summary. DNA Structure.

romanjacobs
Download Presentation

EECS 800 Research Seminar Mining Biological Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

  2. Mining Sequence Patterns in Biological Data • A brief introduction to biology and bioinformatics • Alignment of biological sequences • Hidden Markov model for biological sequence analysis • Summary

  3. DNA Structure • DNA: helix-shaped molecule whose constituents are two parallel strands of four nucleotides: A, C, G, and T. • This assumes only one strand is considered; the second strand is always derivable from the first by • pairing A with T and • C with G • Nucleotides (bases) • Adenine (A) • Cytosine (C) • Guanine (G) • Thymine (T)

  4. Genome • Gene: Contiguous subparts of single strand DNA that are templates for producing proteins. • Chromosomes: compact chains of coiled DNA • Genome: The set of all genes in a given organism. • Noncoding part: The function of DNA material between genes is largely unknown. Source: www.mtsinai.on.ca/pdmg/Genetics/basic.htm

  5. Grand Challenges: Functional Genomics • The function of a protein is its role in the cellular environment. • Function is closely related to tertiary structure • Functional genomics: studies the function of all the proteins of a genome Source: fajerpc.magnet.fsu.edu/Education/2010/Lectures/26_DNA_Transcription.htm

  6. Grand Challenges: Systems Biology • A cell is made up of molecular components that can be viewed as 3D-structures of various shapes • In a living cell, the molecules interact with each other (w. shape and location). An important type of interaction involve catalysis (enzyme) that facilitate interaction. • The interactions of molecules are regulated temporally and spatially. • The overall interaction and the regulation of the molecules is studied by systems biology. Source: www.mtsinai.on.ca/pdmg/images/pairscolour.jpg Human Genome—23 pairs of chromosomes

  7. Grand Challenges: Evolution • Develop a detailed understanding of the heritable variation in the human genome • Understand evolutionary variation across species and the mechanisms underlying it

  8. Grand Challenges: from -omics to Health • Disease detection: • Monitoring the biological system at the molecular level • No intrusive • Fast, reliable, & cost-effective

  9. Grand Challenges: from -omics to Health • Develop novel medicines • Select drug target • Develop robust strategies for identifying the genetic contributions to disease and drug response • Use new understanding of genes and pathways to develop powerful new therapeutic approaches to disease • Drug screening • Drug delivery • Side effects and other safety issues

  10. Data Management & Analysis in Bioinformatics • Computational management and analysis of biological information • Interdisciplinary Field (Molecular Biology, Statistics, Computer Science, Genomics, Genetics, Databases, Chemistry, Radiology …) • Bioinformatics vs. computational biology (more on algorithm correctness, complexity and other themes central to theoretical CS)

  11. Data Mining & Bioinformatics: Why? • Biological data is abundant and information-rich • Genomics & proteomics data (sequences), microarray and protein-arrays, protein database (PDB), bio-testing data • Huge data banks, rich literature, openly accessible • Largest and richest scientific data sets in the world • Data Mining: transform raw datatoknowledge • Mining for correlations, linkages between disease and gene sequences, protein networks, classification, clustering, outliers, ... • Find correlations among linkages in literature and heterogeneous databases

  12. Data Integration in Bioinformatics • Data Integration: Handling heterogeneous bio-data • Build Web-based, interchangeable, integrated, multi-dimensional genome databases • Data cleaning and data integration methods becomes crucial • Mining correlated information across multiple databases itself becomes a data mining task • Typical studies: mining database structures, information extraction from data, reference reconciliation, document classification, clustering and correlation discovery algorithms, ...

  13. Data Mining & Bioinformatics: Current Picture • Pattern discovery • Comparative study of biomolecules and biological networks • Similarity search • Drug design • Clustering • gene expression profiles • Phylogeny tree construction • Classification • Visualization

  14. Future • Research and development of new tools for bioinformatics • Similarity search and comparison between classes of genes (e.g., diseased and healthy) by finding and comparing frequent patterns • New clustering and classification methods for micro-array data and protein-array data analysis • Path analysis: linking genes/proteins to different disease development stages • Develop pharmaceutical interventions that target the different stages separately • High-dimensional analysis and OLAP mining • Visualization tools and genetic/proteomic data analysis • Knowledge integration • You tell me!

  15. Mining Sequence Patterns in Biological Data • A brief introduction to biology and bioinformatics • Alignment of biological sequences • Hidden Markov model for biological sequence analysis • Summary

  16. Comparing Sequences • Sequences to be compared: either nucleotides (DNA/RNA) or amino acids (proteins) • Nucleotides: identical • Amino acids: identical, or if one can be derived from the other by substitutions that are likely to occur in nature • Alignment: Lining up sequences to achieve the maximal level of identity • Local—only portions of the sequences are aligned. • Global—align over the entire length of the sequences • Use gap “–” to indicate preferable not to align two symbols

  17. Comparing Sequences • Two sequences are homologous if they share a common ancestor • Percent identity: ratio between the number of columns containing identical symbols vs. the number of symbols in the longest sequence • Score of alignment: summing up the matches and counting gaps as negative HEAGAWGHEE Input sequences: PAWHEAE HEAGAWGHE-E One alignment --P-AW-HEAE HEAGAWGHE-E Another alignment P-A--W-HEAE

  18. Sequence Alignment: Problem Definition • Goal: • Given two or more input sequences • Identify similar sequences with long conserved subsequences • Method: • Use substitution matrices (probabilities of substitutions of nucleotides or amino-acids and probabilities of insertions and deletions) • Optimal multiple alignment problem: NP-hard • Heuristic method to find good alignments

  19. Edit Distance Model AGGCACA--CAAGGCACACA | |||| || or | || || A--CACATTCAACACATTCA • Minimal weighted sum of insertions, deletions & mutations required to transform one string into another Levenshtein 1966

  20. HEAGAWGHE-E --P-AW-HEAE Pair-wise Sequence Alignment: Scoring Matrix • insertion/Deletion (indel) penalty • Gap penalty: -8 • Gap extension: -8 Mutation penalty (-8) + (-8) + (-1) + 5 + 15 + (-8) + 10 + 6 + (-8) + 6 = 9 HEAGAWGHE-E Exercise: Calculate for P-A--W-HEAE

  21. Complete DNA Sequences nearly 200 complete genomes have been sequenced

  22. Whole Genome Similarity Map

  23. Map of Rearrangements

  24. Alignment of a Syntenic Area

  25. 84790 84800 84810 84820 84830 84840 seq1 -AGGACTTCTTCCTTTCACCATATACAAAAATCAACCCAAGATTGATACAATACTTTAAT |||| || ||||| ||||||||||||||| ||||| || | ||||||| seq2 CAGGATCGACGTCTCTCACCTTATACAAAAATCAACTGAAGATGCATCACAGACTTTAA- 13770 13780 13790 13800 13810 13820 84850 84860 84870 84880 84890 84900 seq1 GTAAAATAAAAAACTGTAAAACTCTAGAAGAAAATGTAGAAACA-----CCATCTGGACA ||| ||| | | || |||||||| ||| | |||| | || ||| |||| seq2 ---AAACTGAAACCATAACAATTCTAGAAGTAAACATTGAAAAAAAAACCCTTCTAGACA 13830 13840 13850 13860 13870 13880 84910 84920 84930 84940 84950 84960 seq1 TCAGCCTGGGCAGAGAATTTATGACTAAGTCCTCAAAAGTAATTTCAACAAAAATAAACA | || | |||| ||| || ||||| ||| |||||| | seq2 TTTGCTTAGGCAAAGACTTCATGACAAAGAATGCAAAAGCAG------------------ 13890 13900 13910 13920 Local Area of Similarity

  26. Formal Description • Problem:PairSeqAlign • Input: Two sequences x, y Scoring matrix s Gap penalty d Gap extension penalty e • Output: The optimal sequence alignment • Difficulty: If x, y are of size n then the number of possible  global alignments is

  27. HEAG --P- -25 HEAGAWGHE-E Add score from table --P-AW-HEAE HEAG- --P-A -33 HEAGA --P-A -20 HEAGA --P—- -33 Gap with bottom Top and bottom Gap with top Global Alignment: Needleman-Wunsch • Needleman-Wunsch Algorithm (1970) • Uses weights for the outmost edges that encourage the best overall (global) alignment • Idea: Build up optimal alignment from optimal alignments of subsequences

  28. Pair-Wise Sequence Alignment Alignment: F(0,0) – F(n,m) We can vary both the model and the alignment strategies

  29. yj aligned to gap F(i-1,j-1) F(i,j-1) s(xi,yj) d F(i-1,j) F(i,j) d xi aligned to gap While building the table, keep track of where optimal score came from, reverse arrows Global Alignment • Uses recursion to fill in intermediate results table • Uses O(nm) space and time • O(n2) algorithm • Feasible for moderate sized sequences, but not for aligning whole genomes.

  30. Dot Matrix Alignment Method • Dot Matrix Plot: Boolean matrices representing possible alignments that can be detected visually • Extremely simple but • O(n2) in time and space • Visual inspection

  31. Heuristic Alignment Algorithms • Motivation: Complexity of alignment algorithms: O(nm) • Current protein DB: 100 million base pairs • Matching each sequence with a 1,000 base pair query takes about 3 hours! • Heuristic algorithms aim at speeding up at the price of possibly missing the best scoring alignment • Two well known programs • BLAST: Basic Local Alignment Search Tool • FASTA: Fast Alignment Tool • Both find high scoring local alignments between a query sequence and a target database • Basic idea: first locate high-scoring short stretches and then extend them

  32. BLAST (Basic Local Alignment Search Tool) • Approach (BLAST) (Altschul et al. 1990, developed by NCBI) • View sequences as sequences of short words (k-tuple) • DNA: 11 bases, protein: 3 amino acids • Create hash table of neighborhood (closely-matching) words • Use statistics to set threshold for “closeness” • Start from exact matches to neighborhood words • Motivation • Good alignments should contain many close matches • Statistics can determine which matches are significant • Much more sensitive than % identity • Hashing can find matches in O(n) time • Extending matches in both directions finds alignment • Yields high-scoring/maximum segment pairs (HSP/MSP)

  33. BLAST (Basic Local Alignment Search Tool)

  34. Multiple Sequence Alignment • Alignment containing multiple DNA / protein sequences • Look for conserved regions → similar function • Example: #Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT #Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC #Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC #OppossumATGGTGCACTTGACTTTT---GAGGAGAAGAACTG #Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT #Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

  35. Mining Sequence Patterns in Biological Data • A brief introduction to biology and bioinformatics • Alignment of biological sequences • Hidden Markov model for biological sequence analysis • Summary

  36. From Data to Model • Given a group of sequence, what is the underlying model for the group of sequences? • Markov models are simple but widely used model to describe a sequence family.

  37. A Markov Chain Model • Transition probabilities • Pr(xi=a|xi-1=g)=0.16 • Pr(xi=c|xi-1=g)=0.34 • Pr(xi=g|xi-1=g)=0.38 • Pr(xi=t|xi-1=g)=0.12 • Exercise • Can “acgt” be generated from the MC model? • How about “cagt”, “gcat”, “tcga”?

  38. Definition of Markov Chain Model • A Markov chain model is defined by • a set of states • some states emit symbols • other states (e.g., the begin state) are silent • a set of transitions with associatedprobabilities • the transitions emanating from a given state define a distribution over the possible next states

  39. Markov Chain Models: Properties • Given some sequence x of length L, we can ask the probability of the sequence with the given model • For any probabilistic model of sequences, we can write thisprobability as • key property of a (1st order) Markov chain: the probabilityof each xidepends only on the value of xi-1

  40. The Probability of a Sequence for a Markov Chain Model Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)

  41. Example Application • CpG islands • CG dinucleotides are rarer in eukaryotic genomes thanexpected given the marginal probabilities of C and G • but the regions upstream of genes are richer in CGdinucleotides than elsewhere – CpG islands • useful evidence for finding genes • Application: Predict CpG islands with Markov chains • one to represent CpG islands • one to represent the rest of the genome

  42. Markov Chains for Discrimination • Suppose we want to distinguish CpG islands from othersequence regions • Given sequences from CpG islands, and sequences fromother regions, we can construct • a model to represent CpG islands • a null model to represent the other regions • can then score a test sequence by:

  43. Markov Chains for Discrimination • Why use • According to Bayes’ rule • If we are not taking into account prior probabilities of twoclasses then we just need tocompare Pr(x|CpG) and Pr(x|null)

  44. Higher Order Markov Chains • The Markov property specifies that the probability of a statedepends only on the probability of the previous state • But we can build more “memory” into our states by using ahigher order Markov model • In an n-th order Markov model

  45. Selecting the Order of aMarkovChain Model • But the number of parameters we need to estimate growsexponentially with the order • for modeling DNA we need parameters for ann-th order model • The higher the order, the less reliable we can expect ourparameter estimates to be • estimating the parameters of a 2nd order Markov chainfrom the complete genome of E. Coli, we’d see eachword > 72,000 times on average • estimating the parameters of an 8-th order chain, we’dsee each word ~ 5 times on average

  46. Hidden Markov Model: A Simple HMM Given observed sequence AGGCT, which state emits every item?

  47. Hidden State • We’ll distinguish between the observed parts of a problemand the hidden parts • In the Markov models we’ve considered previously, it isclear which state accounts for each part of the observedsequence • In the model above, there are multiple states that couldaccount for each part of the observed sequence • this is the hidden part of the problem

  48. An HMM Example

  49. The Parameters of an HMM • Πi is a set of states • Transition Probabilities • Probability of transition from state k to state l • Emission Probabilities • Probability of emitting character b in state k

  50. Three Important Questions • How likely is a given sequence? • The Forward algorithm • What is the most probable “path” for generating a givensequence? • The Viterbi algorithm • How can we learn the HMM parameters given a set ofsequences? • The Forward-Backward (Baum-Welch) algorithm

More Related