190 likes | 287 Views
Presentation – Homework 2. Advanced Topics: Current Bioinformatics Instructor: Dr. Jianhua Ruan. Group MemberS : Jamiul Jahid Mohammad Iftekharul Islam Tanzir Musabbir. NGS Analysis Papers. PatMaN : Rapid Alignment of Short Sequences to Large Databases
E N D
Presentation – Homework 2 Advanced Topics: Current Bioinformatics Instructor: Dr. JianhuaRuan Group MemberS: JamiulJahid Mohammad Iftekharul Islam TanzirMusabbir
NGS Analysis Papers • PatMaN: Rapid Alignment of Short Sequences to Large Databases • Kay Prufer, UdoStenzel, Michael Dannemann, Richard Green, Michael Lachmann • ProbeMatch: Rapid Alignment of Obligonucleotides to Genome Allowing Both Gaps and Mismatches • You Kim, Nikhil Teletia, Victor Ruotti, Maher, James Thomson and Jignesh Patel
PatMaN: Rapid Alignment of Short Sequences to Large Databases • PatMaN – Patter Matching in Nucleotide Databases • A tool for performing exhaustive searches to identify all occurrences of a large number of short sequences within a genome-sized databases. • Reads sequences in FastA format and reports all hits within the given edit-distance cutoff. • Advantages: • Allows predefined number of gaps and mismatches • Ambiguity codes can be searched • Search time is short for perfect matches
ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches • For matching a large set of oligonucleotides sequences against a genome database using gapped alignments • Advantages: • It generates both ungapped and gapped alignments • It allows up to three errors including insertion, deletion and mismatch • It able to detect multiple classes of mutations: SNVs and indels.
ProbeMatch: Background High throughput DNA sequence technologies : Illumina, 454 Life Sciences Large set of short sequences is produced Must be mapped to a genome, allowing for only a few errors Traditional sequence alignment tools can do this, but computationally impractical
ProbeMatch: Background • ELAND (Efficient Local Alignment of Nucleotide Data) • Search DNA databases for a large number of short sequences • Only ungapped alignments allowing up to two mismatches • MAQ (Mapping and Assembly with Quality) • Only ungapped alignments allowing up to three mismatches • Measures error probability of alignements using sequence quality information • SOAP • SeqMap
ProbeMatch: Background These programs are often faster than BLAST by an order of magnitude or more But usually map only 60-80% of the query sequences to genomes Further processing is needed using computationally expensive but sensitive alignment method Overall gain is limited ProbeMatch effectively approaches this challenge
ProbeMatch: Rapid Alignment of Oligonucleotides to Genome Allowing Both Gaps and Mismatches Allows a richer match model Finds gapped and ungappedalignements with up to three errors of any error combination Able to detect multiple classes of mutations
ProbeMatch: Methodology Takes as input a query sequence set and a database of sequences. Database is divided into small segments ProbeMatch loads each segment and build a q-gram index To find potential hits, ProbeMatch searches against q-gram index and extends hits to find longer alignments.
ProbeMatch: Methodology If two sequences Q and T, match within k errors and j non-overlapping fragments are taken from Q, then T contains at least one of the fragments with at most ⌊k/j⌋ errors The matched hits then are extended to check if the entire query sequence and the target sequence can be aligned within k errors Gapped q-gram index (“Better Filtering with gapped q-grams” Burkhardt and Kärkkäinen, 2002) provides more efficient filtering than ungapped q-gram
ProbeMatch: Result 169095 transcriptome short reads from a prostate cell line(RWPE), generated by the Illumina Genome Analyzer, was matched against the human genome using various alignment programs Table : Comparison of execution times and sensitivity
PatMaN: Rapid Alignment of Short Sequences to Large Databases • Algorithm • Constructing a single keyword tree of all the query sequences. • When ambiguity flag is set, a match occurs if the base is one of the nucleotide in ambiguity code. • When ambiguity flag is omitted a base alignment to this character will be counted as a mismatch. • All bases along a query sequence are added as a path from the root of the tree to a leaf, with edge as a base added and leaf as the query sequence id. • Suffix link is also added into the tree
PatMaN: Rapid Alignment of Short Sequences to Large Databases Suppose query sequence is ‘CCC’, ‘GA’, ‘GT’. Basic keyword tree is -- CCC C C C G GA A T GT
PatMaN: Rapid Alignment of Short Sequences to Large Databases After adding the suffix link CCC C C C C G G G GA A T GT
PatMaN: Rapid Alignment of Short Sequences to Large Databases Completing the tree A, T, N CCC A, T, N C C C C G G G GA A A, T, N N T G GT
PatMaN: Rapid Alignment of Short Sequences to Large Databases • Algorithm • Once the tree is completed each sequence in the target database is evaluated base by base and compared to a list of partial matches. • Each partial match consist • A node • Number of mismatches and gaps so far. • The list is initialized with • Root of the tree • An edit count of zero. • In each iteration of the algorithm all partial matches are advanced along a perfectly matching outgoing edges.
PatMaN: Rapid Alignment of Short Sequences to Large Databases • Complexity • Without ambiguity code O(L) time and spaces requires, where L is the total length of all query sequences. • When ambiguity is enabled both time and space requirement increases exponentially. • The time depends on the target database but heavily depends on the maximum edit distances as well as the average length of query sequences. • For each additional edit operation an exponentially increasing number of partial matches must be considered.
PatMaN: Rapid Alignment of Short Sequences to Large Databases • Result • Time constrain of PatMaN means it is suitable for short sequence with a limited number edit operation. • HG -U95 is matched against chimpanzee genome(panTro2) with no gaps but one mismatch. • PatMaN takes 2.5h and found 15.9 million hits.