1 / 19

Presentation – Homework 2

Presentation – Homework 2. Advanced Topics: Current Bioinformatics Instructor: Dr. Jianhua Ruan. Group MemberS : Jamiul Jahid Mohammad Iftekharul Islam Tanzir Musabbir. NGS Analysis Papers. PatMaN : Rapid Alignment of Short Sequences to Large Databases

adelio
Download Presentation

Presentation – Homework 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presentation – Homework 2 Advanced Topics: Current Bioinformatics Instructor: Dr. JianhuaRuan Group MemberS: JamiulJahid Mohammad Iftekharul Islam TanzirMusabbir

  2. NGS Analysis Papers • PatMaN: Rapid Alignment of Short Sequences to Large Databases • Kay Prufer, UdoStenzel, Michael Dannemann, Richard Green, Michael Lachmann • ProbeMatch: Rapid Alignment of Obligonucleotides to Genome Allowing Both Gaps and Mismatches • You Kim, Nikhil Teletia, Victor Ruotti, Maher, James Thomson and Jignesh Patel

  3. PatMaN: Rapid Alignment of Short Sequences to Large Databases • PatMaN – Patter Matching in Nucleotide Databases • A tool for performing exhaustive searches to identify all occurrences of a large number of short sequences within a genome-sized databases. • Reads sequences in FastA format and reports all hits within the given edit-distance cutoff. • Advantages: • Allows predefined number of gaps and mismatches • Ambiguity codes can be searched • Search time is short for perfect matches

  4. ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches • For matching a large set of oligonucleotides sequences against a genome database using gapped alignments • Advantages: • It generates both ungapped and gapped alignments • It allows up to three errors including insertion, deletion and mismatch • It able to detect multiple classes of mutations: SNVs and indels.

  5. ProbeMatch: Background High throughput DNA sequence technologies : Illumina, 454 Life Sciences Large set of short sequences is produced Must be mapped to a genome, allowing for only a few errors Traditional sequence alignment tools can do this, but computationally impractical

  6. ProbeMatch: Background • ELAND (Efficient Local Alignment of Nucleotide Data) • Search DNA databases for a large number of short sequences • Only ungapped alignments allowing up to two mismatches • MAQ (Mapping and Assembly with Quality) • Only ungapped alignments allowing up to three mismatches • Measures error probability of alignements using sequence quality information • SOAP • SeqMap

  7. ProbeMatch: Background These programs are often faster than BLAST by an order of magnitude or more But usually map only 60-80% of the query sequences to genomes Further processing is needed using computationally expensive but sensitive alignment method Overall gain is limited ProbeMatch effectively approaches this challenge

  8. ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches Allows a richer match model Finds gapped and ungappedalignements with up to three errors of any error combination Able to detect multiple classes of mutations

  9. ProbeMatch: Methodology Takes as input a query sequence set and a database of sequences. Database is divided into small segments ProbeMatch loads each segment and build a q-gram index To find potential hits, ProbeMatch searches against q-gram index and extends hits to find longer alignments.

  10. ProbeMatch: Methodology If two sequences Q and T, match within k errors and j non-overlapping fragments are taken from Q, then T contains at least one of the fragments with at most ⌊k/j⌋ errors The matched hits then are extended to check if the entire query sequence and the target sequence can be aligned within k errors Gapped q-gram index (“Better Filtering with gapped q-grams” Burkhardt and Kärkkäinen, 2002) provides more efficient filtering than ungapped q-gram

  11. ProbeMatch: Result 169095 transcriptome short reads from a prostate cell line(RWPE), generated by the Illumina Genome Analyzer, was matched against the human genome using various alignment programs Table : Comparison of execution times and sensitivity

  12. PatMaN: Rapid Alignment of Short Sequences to Large Databases • Algorithm • Constructing a single keyword tree of all the query sequences. • When ambiguity flag is set, a match occurs if the base is one of the nucleotide in ambiguity code. • When ambiguity flag is omitted a base alignment to this character will be counted as a mismatch. • All bases along a query sequence are added as a path from the root of the tree to a leaf, with edge as a base added and leaf as the query sequence id. • Suffix link is also added into the tree

  13. PatMaN: Rapid Alignment of Short Sequences to Large Databases Suppose query sequence is ‘CCC’, ‘GA’, ‘GT’. Basic keyword tree is -- CCC C C C G GA A T GT

  14. PatMaN: Rapid Alignment of Short Sequences to Large Databases After adding the suffix link CCC C C C C G G G GA A T GT

  15. PatMaN: Rapid Alignment of Short Sequences to Large Databases Completing the tree A, T, N CCC A, T, N C C C C G G G GA A A, T, N N T G GT

  16. PatMaN: Rapid Alignment of Short Sequences to Large Databases • Algorithm • Once the tree is completed each sequence in the target database is evaluated base by base and compared to a list of partial matches. • Each partial match consist • A node • Number of mismatches and gaps so far. • The list is initialized with • Root of the tree • An edit count of zero. • In each iteration of the algorithm all partial matches are advanced along a perfectly matching outgoing edges.

  17. PatMaN: Rapid Alignment of Short Sequences to Large Databases • Complexity • Without ambiguity code O(L) time and spaces requires, where L is the total length of all query sequences. • When ambiguity is enabled both time and space requirement increases exponentially. • The time depends on the target database but heavily depends on the maximum edit distances as well as the average length of query sequences. • For each additional edit operation an exponentially increasing number of partial matches must be considered.

  18. PatMaN: Rapid Alignment of Short Sequences to Large Databases • Result • Time constrain of PatMaN means it is suitable for short sequence with a limited number edit operation. • HG -U95 is matched against chimpanzee genome(panTro2) with no gaps but one mismatch. • PatMaN takes 2.5h and found 15.9 million hits.

  19. Q/A?

More Related