240 likes | 401 Views
The Barcode of Life Data Portal ( http://bol.uvm.edu). Dr. David E Schindel, Executive Secretary Michael Trizna, Database Specialist Consortium for the Barcode of Life (CBOL) Smithsonian Institution Washington, DC www.barcodeoflife.org; SchindelD@si.edu and TriznaM@si.edu.
E N D
The Barcode of LifeData Portal(http://bol.uvm.edu) Dr. David E Schindel, Executive Secretary Michael Trizna, Database Specialist Consortium for the Barcode of Life (CBOL) Smithsonian Institution Washington, DC www.barcodeoflife.org; SchindelD@si.edu and TriznaM@si.edu
Contents of Presentation • Crowd-sourced open source software • How does Data Portal complement BOLD and GenBank? • Data Portal capabilities • Case Study: Smithsonian frozen bird tissue project
An Experiment in Museum Tissue Mining and Fast Data Release • Tissue sampling winter/spring • Sequencing completed in September • Sequence quality control in October • Taxonomic checking in early November • Obvious errors removed • Minor discrepancies remain • Data released for Adelaide Conference • Crowd-sourced annotation by community • Will data be mis-used?
Unique Data Portal Capabilities • Creating customized datasets from public and/or your private data • Online library of standard datasets • Support sharing within project teams using Connect IDs, easy link to Working Groups • Running different identification analyses based on different methodologies: • Standard sequence input using FASTA format • Use standard or customized datasets
Barcode Aggregator 727,170 public records
Existing Data Analysis Packages • LIST of packages • BLOG • BRONX • Kernel • CAOS • USEARCH • BLAST • Output of identification routines as probabilities of assignment
Data Analysis Methods Session • New packages presented Friday afternoon: • Damon Little: Automatic Plants Barcode pipeline (from raw traces to trimmed/edited sequences) • Ka Hou Chu: Composite Vector Method (profile trees for faster alignment and tree-based analysis) • Alain Franc: Matching Next Generation results to Sanger-based reference records
The USNM Bird Project • USNM Division of Birds frozen tissue collection: • 21,104 specimens, 2512 species • Which new ones onesto sample/barcode? • Public records for birds • All public bird COI records: 10,967 • All BARCODE records in GenBank: 8,419 • BARCODE with taxonomic names: 7,965 • BARCODE, name and 2 traces: 2,388
Moving Data Among BOLD, GenBank, Data Portal USNM Excel Spreadsheet (KE-Emu Source) BOLD Split into projects that consist of 2-4 plates Localdatabase that holds all fields from the original spreadsheet Data Portal Aggregator database
Creating a ‘Pick List’ • Spreadsheet of tissue samples compared with: • ITIS taxonomy • Clemens species list in BOLD • Counts of GenBank and/or public BOLD records • Geographic informattion • Screenshot of USNM list side-by-side with BOLD records
USNM Bird Dataset • 3150 tissues sampled • 168 failed sequences • 94 problematic sequences • 166 clustered badly • 2761 ‘BARCODE-ready’ samples • 1,147 ‘first-BARCODE’ species • 91% increase over 1,259 barcoded species • (3,892 listed in BOLD includes BINs, others)
Two problematic clades, USNM data • Flycatchers: Family Tyrannidae • Sublegatusarenarum, S. modestus, S. obscurior, S. sp. • Conopiasparvus, C. albovittatus • Myiarchusferox, M. swainsoni, M. sp. • Hummingbirds: Family Trochilidae • Phaethornislonguemareus • Inconsistencies within USNM dataset • Incompatibilities with public, other data
What testing dataset to use? • ID trees and analytical routines could use: • All public bird COI records: 10,967 • All BARCODE records in GenBank: 8,419 • BARCODE with taxonomic names: 7,965 • BARCODE, name and 2 traces: 2,388 • Which ones have reliable taxonomic IDs?
Preparing a Data Release Paper • Summary statistics from Data Portal • Figures from BOLD