1 / 77

Introduction on biological sequence indexing, searching and text mining

Introduction on biological sequence indexing, searching and text mining. 序列索引、搜尋及文字探勘之簡介 王世融. Outlines. Introduction on sequence searching Patten Searching -Suffix tree BLAST Introduction on text mining -Case study. Bioinformatics topics.

arista
Download Presentation

Introduction on biological sequence indexing, searching and text mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction on biologicalsequence indexing,searching and text mining 序列索引、搜尋及文字探勘之簡介 王世融

  2. Outlines • Introduction on sequence searching • Patten Searching -Suffix tree • BLAST • Introduction on text mining -Case study

  3. Bioinformaticstopics

  4. Taking a glance on bioinformatics--Information Retrieval • Indexing Designing a spelling checker absurb absorb (mistyped) (really wanted) How fast can the system retrieve all potential candidates? absurb, absorb… What if he mistyped this as cbsorb

  5. Indexing • Create a database that lists all words containing b boy, by, absorb, … i I, ill, illustrate, … n next, net no, … d dance, drive, dog, … • Which words contain ”b,i,n,d”? • When someone typed ”bond” Which words contain “b,o,n,d” or at least 3 out of these 4 letters? • dictionary:a simple search engine for lexicons

  6. What is the relation? Treating Genomic/Proteomic dataas aLanguage

  7. Decoding an unknown language • For proteomic data: Amino acid motif protein Alphabet word sentence Protein structure Sentence meaning • Finding the interrelationships of data Data Mining, Knowledge Discovery

  8. Pattern Matching—Sequence indexing&searching • Hash Tables • Repeat Finding • Exact Pattern Matching • Keyword Trees • Suffix Trees • Approximates String Matching

  9. Why Pattern Matching? • Pattern matching has important applications in Bioinformatics: -Finding repeats in genomes -Finding unique sequence in genomes -Searching databases for known and similar nucleotide sequences

  10. Genomic Repeats • Example of repeats: ATGGTCTAGGTCCTAGTGGTC • Motivation to find them: -Genomic rearrangement, duplications, and deletions are often associated with repeats -Trace evolutionary secrets -Many cancers are characterized by an explosion of repeats

  11. k-mer Repeats • Long repeats are difficult to find • Shot repeats are easy to find (e.g., hashing) • Simple approach to finding long repeats: Find exact repeats of short k-mers( is usually 10 to 13) Use k-mer repeats to potentially extend into longer, maximal repeats Repeat-(shorten) -> repeat Uniqueness-(extended)-> Uniqueness GCTTACAGATTCAGTCTTACAGATGGT The 4-mer TTAC starts at locations 3 and 17

  12. Extending k-mer Repeats GCTTACAGATTCAGTCTTACAGATGGT • Extend these 4-mer matches: GCTTACAGATTCAGTCTTACAGATGGT • Maximal repeat: TTACAGAT -To find maximal repeats in this way, we need ALL start locations of all k-mers in the genome - Hashing lets us find repeats quickly in this manner

  13. Hashing: What is it? • What does hashing do? For different data, generate unique integer Store data in an array at the unique integer index generated from the data • Hashing is a very efficient way to Store and Retrieve data

  14. Hashing: Example • Where do the animals eat? • Records: each animal • Key: where each animal eats

  15. Hashing DNA sequences • Each k-mer can be translated into a binary string (A, T, C, G can be represented as 00,01,10,11) • After assigning a unique integer per k-mer it is easy to get all start locations of each k-mer in a genome

  16. Hashing: Maximal Repeats • To find repeats in a genome: For all k-mers in the genome, note the start position and the sequence Generate a hash table index for each unique k-mer sequence In each index of the hash table, store all genome start locations of the k-mer which generates that index • Extend k-mer repeats to maximal repeats

  17. Hashing: Collisions • Dealing with collisions: “Chain” all start location of k-mers (linked list)

  18. Pattern Matching • What if, instead of finding repeats in a genome, we want to find sequences in a database that contain a given pattern? • This leads us to a different problem, the Pattern Matching Problem

  19. Pattern Matching algorithm for: Pattern GCAT Text CGCATC GCAT CGCATC GCAT CGCATC GCAT CGCATC GCAT CGCATC GCAT CGCATC Exact Pattern Matching: An Example

  20. Exact Pattern Matching: Running Time • Pattern Matching runtime: O(nm) • Probability-wise, it is less than O(nm) Better solution: suffix trees -Can build text in O(m) time find pattern length n in O(n) time -Conceptually related to keyword trees

  21. Keyword Trees Properties of keyword trees: • Store a set of keyword in a rooted labeled tree • Each edge labeled with a letter form an alphabet • Every keyword stored can be spelled on a path from root to some leaf

  22. Keyword Trees: Threading • To match patterns in a text using a keyword tree: -Build keyword tree of patterns -“Thread” the text through the keyword tree

  23. Keyword Tree: Threading (cont’d) • Threading is “complete” when we reach a leaf in the keyword tree • When threading is “complete,” we’ve found a pattern in the text

  24. Keyword Tree: Threading (cont’d) • Thread “appeal” appeal

  25. Keyword Tree: Threading (cont’d) • Thread “apple” apple

  26. Suffix Trees • Similar to keyword trees, except edges are condensed - All internal edges have at least two outgoing edges - Leaves labeled with suffix start position - Suffix trees build faster than keyword trees

  27. Use of Suffix Trees • Suffix trees hold all suffixes of a text i.e., ATCGC: ATCGC, TCGC, CGC, GC, C Builds in O(m) time for text of length m • To find any pattern of length n in a text: Build suffix tree for text Thread the pattern through the suffix tree Can find pattern in text in O(n) time! • O(n+m) time for “Pattern Matching Problem” Build suffix tree and lookup pattern

  28. Suffix Trees: Example

  29. Multiple Pattern Matching: Summary • Keyword and suffix trees are used to find patterns in a text (i.e. homology between different genes) • Keyword trees: -Build keyword tree of patterns, and thread text through it • Suffix trees: -Build suffix tree of text, and thread patterns through it

  30. An Excellent Research example: Annotating Large Genomes With Exact Word Matches John Healy Elizabeth E. Thomas Jacob T. Schwartz and Michael Wigler Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; Courant Institute of Mathematical Science, New York University, New York, New, York 10003, USA Genome Research 2003

  31. An Excellent Research example: Motivation • Finding highly unique probes from the genome • Finding word count for words of arbitrary length in any given genome

  32. An Excellent Research example:Issues and solutions • When word length exceeds 15, directly addressable structures become impractical -Suffix array, suffix tree? • Creating a reversible permutation of a string of text that tends to be more compressible than the original text -based on Burrows-Wheeler transform [‘94]

  33. Approximate vs. Exact Pattern Matching • So far all we’ve seen are exact pattern matching algorithms • Usually, because of mutations, it makes much more biological sense find approximate pattern matches • Biologists often use fast heuristic approaches (rather than local alignment) to find approximate matches

  34. Dot matrices show similarities between two sequences FASTA makes an implicit dot matrix from short exact matches, and tries to find long diagonals (allowing for some mismatches) Dot Matrices

  35. Dot Matrices (cont’d) • Identify diagonals above a threshold length Diagonals in the dot matrix indicate exact substring matching • Extend diagonals and try to link them together, allowing for minimal mismatches/indels Linking diagonals reveals approximate matches over longer substrings Single-nucleotide matching v.s. di-nucleotide matching

  36. BLAST: Basic Local Alignment Search Tool • A Google for biological sequences • Approximate sequence alignment • Not only using BLAST, but also understanding why it works -Not just a computational tool -Misapply it, misinterpret its results

  37. Interpreting New Words with a Dictionary • Finding the sentence with the same (or similar) semantics • Encountering a new word: “rucksack” Meaningless without a dictionary or some point of reference • Encountering a DNA or protein sequence: Need a point of reference No dictionary available but thesaurus exists • Rucksack backpack, bag, purse Does not give exact meaning, but helps with understanding

  38. The statistics of sequence similarity Scores • Raw score S

  39. The statistics of sequence similarity scores • Bit-Scores S’ -Calculated from raw scores S -Normalized with respect to the scoring system -Used to compare scores from different searches • E-value -Expectation value -The number of different alignments with scores >=S in database search -E-value ↓, S is more significant

  40. BLAST: Locally Maximal Segment Pairs • A segment pair is maximal if it has the best score over all segment pairs • A segment pair is locally maximal if its score can’t be improve by extending or shortening • Statistically relevant locally maximal segment pairs are of biological interest • BLAST finds all locally maximal segment pairs with scores above some threshold A significantly high threshold will filter out some “junk”, irrelevant matches

  41. High Scoring Pair (HSPs) • All segment pair whose scores can not be improved by extension or trimming • Need to model a random sequence to analyze how high the score is in relation to chance

  42. Expected number of HSPs • Expected number of HSPs with score ≧S • E-value E for the score S: E=Kmn2 • Given: Two sequence, length n and m The statistics of HSP scores are characterized by two parameters K and λ K: scale for the search space size λ: scale for the scoring system

  43. Dictionary: All words of length K (~10) Alignment initiated between word of alignment score ≧T (typically T=K) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score >statistical threshold Indexing-based local alignment

  44. BLAST algorithm (cont’d)

  45. Gapped extensions Extensions with gaps in a band around anchor Output: GTAAGGTCC-AGT GTTAGGTCCTAGT Indexing-based local alignment-Extensions

  46. Types of BLAST • blastn: Nucleotide-nucleotide • blastp: Protein-protein • blastx: Translated query vs. protein database • tblastn: Protein query vs. translated database • tblastx: Translated query vs. translated database (6 frames each )

  47. Right Tool for the Right Job

  48. Variants of BLAST • NCBI BLAST: search the universe http://www.ncb.nlm.nih.gov/BLAST/ • PSI-BLAST Find members of a protein family or build a custom position-specific score matrix Bootstrapping results to find very related sequence • MEGABLAST: Optimized to align very similar sequence Works best when k=4i≧16 Linear gap penalty • Wu-BLAST: (Wash U BLAST) Very good optimizations Good set of feature & command line arguments • BLAT Faster, less sensitive than BLAST Good for aligning huge numbers of queries • CHAOS Uses inexact k-mers, sensitive • PatternHunter Uses patterns instead of k-mers • BlastZ Uses patterns, good for finding genes

  49. Sample BLAST output • Blast of human beta globin protein against zebra fish

  50. Sample BLAST output (cont’d) • Blast of human beta globin DNA against human DNA

More Related