NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes

NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes (Paper Presentation – CS 394C) T.Z. DeSantis, P. Hugenholts, K. Keller, E.L. Brodie, N. Larsen, Y.M. Piceno, R. Phan & G.L. Andersen Presentation: Andrei Margea Coordinator: Prof. Tandy Warnow

About the Paper • Published in Nucleic Acids Research in 2006. • Describes NAST (Nearest Alignment Space Termination), a multiple sequence alignment (MSA) algorithm. • Describes an online tool to efficiently align thousands of 16S rRNA gene sequences based on NAST, using a precomputed alignment on ~10,000 non-chimeric sequences representative of bacteria and archaea diversity.

Multiple Sequence Alignments • In the process of inferring evolutionary histories from sequenced data, one important step is estimating a MSA on the input sequences. • An MSA is represented as a two-dimensional matrix, where the rows correspond to genes and the columns to specific sites. The entries are nucleotides or gaps. A gap in a specific position means the gene corresponding to that row lacks a base in the position corresponding to the column. • Basically, an MSA is a way of expressing positional homology of multiple genes (along columns). • Alignments are useful because the gaps mark insertion or deletion events. • However, in projects where the input exceeds ~100,000 16S rRNA gene sequences, the alignment step becomes a bottleneck.

Estimating an MSA • After computing an MSA on a set of sequences (a ‘profile’ alignment), new sequences can be added without recomputing the optimal gap placements for the whole MSA again. • When adding a ‘candidate’ sequence to a profile alignment, a number of insertion events might be discovered. There are two approaches to deal with this issue: • Insert gaps in the profile alignment, and therefore allow the size of the profile sequences to grow; • Allow a local misalignment of the candidate sequence, and maintain the size of the profile sequences.

Allowing alignments to grow… • Common practice • However, …

NAST Overview • NAST is an MSA alignment algorithm which enables fixed column counts. • It uses a precomputed alignment on a ‘Core Set’ of ~10,000 representative sequences (selected from a set of >80,000 16S rDNA genes from GenBank). This alignment has 7,682 columns. • In a more recent version of NAST (PyNAST), the user can specify the template alignment as an input.

NAST – I/O • Input: • A candidate sequence. • A template alignment. (Only in PyNAST) • Output: • An MSA.

NAST – How It Works • In the case of multiple candidate sequences, each one of them is aligned separately into the template alignment. • First, the sequence most similar to the candidate sequence is determined using BLAST. The candidate is then trimmed of flanking sequence data.

NAST – How It Works

NAST – Performance • Since each candidate sequence is aligned separately, the running time is linear in the number of sequences. • O(mn) for the pairwise alignment step (once the template sequence has been identified) when using BLAST, where m and n are the lengths of the candidate and template sequences. • Old NAST was able to align ~10 16S rRNA sequences per minute (using Intel Xeon 2.4GHz processors).

NAST – Performance • A newer (2010) implementation – PyNAST – was written and compared to NAST under the same conditions (both using BLAST 2.2.16 for the database search and pairwise alignment search) on a collection of 30,000 16S rRNA sequences and subsets of this collection. • Results show that PyNAST reduces the run time from 1.55s/sequence to 1.46s/sequence.

NAST – Performance

Thoughts • Significant time performance. • 100,000 seqences in • 7days (old NAST – 2006 version) • 1day 19hrs (new NAST – 2010 version) • 1day 16hrs 30min (PyNAST) • Unclear why allowing sequences to grow is inconvenient . • Inconvenient for sets of short candidate sequences (if template sequences are large as in the original NAST). • Output alignment might not optimize the relative alignment among the candidate sequences.

References • DeSantis, T.Z.; Hugenholtz, P.; Keller, K.; Brodie, E.L.; Larsen, N.; Piceno, Y.M.; Phan, R. and G.L. Andersen. 2006. NAST: A multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acid. Res. 34(Web Server issue): W394–W399; • Caporaso, J.G.; Bittinger, Kyle; Bushman, F.D.; DeSantis,T.Z.; Andersen, G.L.; and Knight, R. PyNAST: a flexible tool for aligning sequences to a template alignment. January 15, 2010, DOI 10.1093/bioinformatics/btp636. Bioinformatics 26: 266-267.

Thank You!! • Questions?

NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes