160 likes | 366 Views
BARCODE SEQUENCE DATAFLOW INTO GENBANK. Ilene Mizrachi November 30, 2011 Fourth International Barcode of Life Conference. Barcode Project -2003 and beyond. Barcode of Life project was initiated at in 2003 INSDC would be the repository for raw and assembled sequence data
E N D
BARCODE SEQUENCE DATAFLOW INTO GENBANK Ilene Mizrachi November 30, 2011 Fourth International Barcode of Life Conference
Barcode Project -2003 and beyond • Barcode of Life project was initiated at in 2003 • INSDC would be the repository for raw and assembled sequence data • INSDC adopts new source fields to accommodate Barcode metadata requirements • Barcode of Life Database (BOLD) established as a community workbench and sequencing center
What is a Barcode? • A global reference library of DNA barcode sequences that is integrated with other systems of biodiversity information (e.g., databases of specimens, species, biogeographic information). • Mechanism to link DNA sequences to vouchered specimens and valid species names. • A reserved BARCODE keyword was adopted for data that met strict barcode standards
Barcode Standard • Formally described species or a provisional label for an unpublished species • Voucher specimen identifier, preferably in a biorepository using a structured field • Country-Code using the controlled vocabulary used by GenBank; • Sequence from a gene region specified by the CBOL • COI for animals • matK and rbcL for plants • ITS for fungi • Contain at least 75% contiguous, high quality bases from within the approved region • Electropherogram trace files for bidirectional sequencing runs • Sequences of all forward and reverse primers • Strongly recommended data elements • GPS coordinates • Name of the identifier • Name of the collector • Date of collection
QA checks at GenBank To ensure that the sequence data is of high quality, the following checks are run: • Barcode data element compliance • Consistency checks such as: • reported latitude-longitude falls within cited country • collection date has already occurred • Sequence quality checks
Checking Sequence Quality • Trim primer sequences • Check congruence between fwd and reverse reads • Align sequences to check for gaps • Translate sequences to check for internal stops
Updates Are Critical • Primary data repository – sequence records owned by submitter • Submitter is responsible for providing additional data and metadata as it becomes available: • Publication • Sequence • Taxonomy • Voucher • Third party updates are welcome!
Challenges • If Reference Barcodes are to be used for species identification, phylogenetics, ecological forensics, conservation, and macro-analysis of biodiversity patterns, then the minimal requirement should be (a) high quality sequence (b) link to specimen and (c) taxonomic identification • Need to support rapid data release including preliminary taxonomic classifications similar to “Fort Lauderdale Principles” of genomics community • Data updated asynchronously at BOLD and in GenBank. Need to continue work on update channel • Need to work with communities to devise strict QA tests for plant and fungal Barcodes
Acknowledgements • Taxonomy Group • Scott Federhen • Conrad Schoch • Lu Sun • Carol Hotton • DetlefLeipe • GenBank Group • Susan Schafer • Michael Fetchko • Software Support • Colleen Bollin • KamenTodorov • VasukiGobu