1 / 17

BARCODE Quality Assessment: Frequency Matrix Approach

This study explores the potential errors in BARCODE sequencing, such as taxonomic mislabeling and pseudogenes, and proposes a frequency matrix approach for quality assessment. The analysis is based on a large dataset of avian BARCODEs, highlighting the distribution and impact of rare variants. The results show that avian BARCODE sequence quality is high and improving, and caution is advised when studying rare variants. The frequency matrix approach has potential utility for database quality assessment.

rshurtz
Download Presentation

BARCODE Quality Assessment: Frequency Matrix Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BARCODE Quality Assessment:Frequency Matrix Approach Mark Y Stoeckle, Rockefeller University Kevin C R Kerr, Royal Ontario Museum PLoS ONE August 2012 e43992

  2. Potential BARCODE errors • Taxonomic mislabeling • Pseudogenes • Sequencing Error

  3. Hypothesis: Rare Variants Sequencing Errors =

  4. Avian BARCODEs: large, representative dataset11K records/2.7K species (27% known)

  5. Most 1st, 2nd positions >99.9% conserved

  6. Most aa positions >99.9% conserved

  7. Working definitions for rare variants: • VERY LOW FREQUENCY VARIANT (VLF): nt, aa in <0.1% seqs at a given position • SINGLETON VLF: in 1 indiv/species • SHARED VLF: in ≥2 indiv/species

  8. Spatial distribution of VLFs • Singleton VLFs: (mostly) seq error • Shared VLFs: (mostly) biological Concentrated at ends of segment Relatively evenly distributed

  9. Sliding window analysis nVLFs

  10. Calculating Error Rate > • Most (94%) 2nd positions >99.9% conserved • (Nearly) all 2nd position seqerrors are VLFs 187 2ndpossi-nVLFs (probable errors) . 216 2ndpos/BC x 10,760 BC = 8 x 10-5 errors/base pair

  11. Limitations, Observations • First seq error assessment for BARCODEs • Some singleton nVLFslikely biological—calc rate is upper limit • ~3% BARCODEs ≥ 1 error (av 1.7/BARCODE) • Seq errors unlikely to affect species ID • Increased apparent intraspecific variation

  12. Applications: 1. Compare database quality Error bars =95% CI

  13. 2. Highlight BARCODEsw probableerrors--annotate? Annotated Homer Simpson Portrait

  14. 3. BONUS: cryptic pseudogenes flagged by multiple SHARED VLFs Alder flycatcher (Empidonaxalnorum)

  15. Cryptic pseudogenes uncommon: 0.1% BARCODEs Fuscous flycatcher (Cnemotriccusfuscatus) Canada goose (Brantacanadensis)

  16. Conclusions • Avian BARCODE sequence quality high and improving • Frequency matrix potential utility for sequence database QA • Caution on studies involving rare variants

  17. Acknowledgments Alan Baker Jan Lifjeld ArildJohnsen Per Ericson Carla Dove Gary Graves Pablo TubaroDario Litjmaer Natural Sciences and Engineering Research Council of Canada Jesse Ausubel

More Related