1 / 16

NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes

NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes (Paper Presentation – CS 394C). T.Z. DeSantis, P. Hugenholts, K. Keller, E.L. Brodie, N. Larsen, Y.M. Piceno, R. Phan & G.L. Andersen. Presentation: Andrei Margea Coordinator: Prof. Tandy Warnow.

sine
Download Presentation

NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes (Paper Presentation – CS 394C) T.Z. DeSantis, P. Hugenholts, K. Keller, E.L. Brodie, N. Larsen, Y.M. Piceno, R. Phan & G.L. Andersen Presentation: Andrei Margea Coordinator: Prof. Tandy Warnow

  2. About the Paper • Published in Nucleic Acids Research in 2006. • Describes NAST (Nearest Alignment Space Termination), a multiple sequence alignment (MSA) algorithm. • Describes an online tool to efficiently align thousands of 16S rRNA gene sequences based on NAST, using a precomputed alignment on ~10,000 non-chimeric sequences representative of bacteria and archaea diversity.

  3. Multiple Sequence Alignments • In the process of inferring evolutionary histories from sequenced data, one important step is estimating a MSA on the input sequences. • An MSA is represented as a two-dimensional matrix, where the rows correspond to genes and the columns to specific sites. The entries are nucleotides or gaps. A gap in a specific position means the gene corresponding to that row lacks a base in the position corresponding to the column. • Basically, an MSA is a way of expressing positional homology of multiple genes (along columns). • Alignments are useful because the gaps mark insertion or deletion events. • However, in projects where the input exceeds ~100,000 16S rRNA gene sequences, the alignment step becomes a bottleneck.

  4. Estimating an MSA • After computing an MSA on a set of sequences (a ‘profile’ alignment), new sequences can be added without recomputing the optimal gap placements for the whole MSA again. • When adding a ‘candidate’ sequence to a profile alignment, a number of insertion events might be discovered. There are two approaches to deal with this issue: • Insert gaps in the profile alignment, and therefore allow the size of the profile sequences to grow; • Allow a local misalignment of the candidate sequence, and maintain the size of the profile sequences.

  5. Allowing alignments to grow… • Common practice • However, …

  6. NAST Overview • NAST is an MSA alignment algorithm which enables fixed column counts. • It uses a precomputed alignment on a ‘Core Set’ of ~10,000 representative sequences (selected from a set of >80,000 16S rDNA genes from GenBank). This alignment has 7,682 columns. • In a more recent version of NAST (PyNAST), the user can specify the template alignment as an input.

  7. NAST – I/O • Input: • A candidate sequence. • A template alignment. (Only in PyNAST) • Output: • An MSA.

  8. NAST – How It Works • In the case of multiple candidate sequences, each one of them is aligned separately into the template alignment. • First, the sequence most similar to the candidate sequence is determined using BLAST. The candidate is then trimmed of flanking sequence data.

  9. NAST – How It Works

  10. NAST – How It Works

  11. NAST – Performance • Since each candidate sequence is aligned separately, the running time is linear in the number of sequences. • O(mn) for the pairwise alignment step (once the template sequence has been identified) when using BLAST, where m and n are the lengths of the candidate and template sequences. • Old NAST was able to align ~10 16S rRNA sequences per minute (using Intel Xeon 2.4GHz processors).

  12. NAST – Performance • A newer (2010) implementation – PyNAST – was written and compared to NAST under the same conditions (both using BLAST 2.2.16 for the database search and pairwise alignment search) on a collection of 30,000 16S rRNA sequences and subsets of this collection. • Results show that PyNAST reduces the run time from 1.55s/sequence to 1.46s/sequence.

  13. NAST – Performance

  14. Thoughts • Significant time performance. • 100,000 seqences in • 7days (old NAST – 2006 version) • 1day 19hrs (new NAST – 2010 version) • 1day 16hrs 30min (PyNAST) • Unclear why allowing sequences to grow is inconvenient . • Inconvenient for sets of short candidate sequences (if template sequences are large as in the original NAST). • Output alignment might not optimize the relative alignment among the candidate sequences.

  15. References • DeSantis, T.Z.; Hugenholtz, P.; Keller, K.; Brodie, E.L.; Larsen, N.; Piceno, Y.M.; Phan, R. and G.L. Andersen. 2006. NAST: A multiple sequence alignment server for comparative analysis of 16S rRNA genes.  Nucleic Acid. Res. 34(Web Server issue): W394–W399; • Caporaso, J.G.; Bittinger, Kyle; Bushman, F.D.; DeSantis,T.Z.; Andersen, G.L.; and Knight, R. PyNAST: a flexible tool for aligning sequences to a template alignment. January 15, 2010, DOI 10.1093/bioinformatics/btp636. Bioinformatics 26: 266-267.

  16. Thank You!! • Questions?

More Related