1 / 55

Integration of New Data into RGD: Quality Control and Data Submission Tools

This paper discusses the integration of large amounts of data into the Rat Genome Database (RGD), with a focus on quality control and data submission tools. It outlines the challenges faced by the limited curation staff and the informatic methods used to address them.

mmontagna
Download Presentation

Integration of New Data into RGD: Quality Control and Data Submission Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integration of New Data into RGD: Quality Control and Data Submission Tools Rat Genome Database http://rgd.mcw.edu Bioinformatics Research CenterMedical College of Wisconsin, Milwaukee, USA

  2. RGD Pipeline Background • RGD is a relatively new MOD • Needed to integrate large amounts of historic data • Curation staff is limited • Developed near beginning of RGD Project • Efficient methods to evaluate and integrate data • Informatic methods were chosen to address the problems • Catch up with historic data • Achieve good productivity with limited staff • Modular design • New types of data • New QC checks and methods

  3. Data Sources RGD Jackson Labs Markers, Strains, Genes Goteborg (Sweden) Genes, markers, QTLs Literature Otsuka MGD RatMap Otsuka (Japan) SSLPs MCW RHdb MIT All Objects EBI (UK) Markers, Primers RGD 2.0 WI/MIT Markers, Genetic Map MCO ARB Maps and SSLPs NIAMS SSLPs NCBI UI NCBI LocusLink, RefSeq, UniGene, etc. Baylor U. Iowa ESTs, RH Map Baylor (HGSC) Sequences

  4. Databases Websites Literature Informatic data mining Bulk Data Pipeline in the Curation Process Regular Journal Screening And Curation Data Sources Internal Data External Data Data Pipeline RGD Database Ongoing Data Curation

  5. RGD Objects • RGD stores information about • 11 fundamental data types (Objects) • Genes • Strains • QTLs • Traits • Sequences • ESTs • Maps • SSLPs • References • Homologs • Phenotypes

  6. Relationships between RGD Objects Genes -> Genes, ESTs, SSLPs, and QTLs ESTs -> Genes SSLPs -> Genes and Strains QTLs -> Genes, Traits, Strains Traits -> QTLs Maps -> Maps Data Maps Data -> any RGD object References -> any RGD object Homologs -> any RGD object Strains -> any RGD object Sequences -> any RGD object Phenotypes -> any RGD object

  7. RGD object Templates

  8. Internal Data Sources QC functionality on data entry forms

  9. Curation Annotations Notes Editor

  10. Data Entry Summary Page

  11. Edit Record in Submission Database

  12. RGD Data Flow Bulkdata All objects Production Curation Bulk Data (Production-load) owner_1 Owner_2 Cur_1 QC dss Curation data QC QC rgd.mcw.edu Online Genes QTLs Strains Public System Internal Systems

  13. QC checks in the Data Flow Blasting results Input raw data BD Pipeline Database Check data RGD Database Preload data Load data • keep all raw data • format the data • track all checking flags • track all loading status Web-based interface to view all processing status

  14. QC Process Overview Incoming Dataset Level One: Integrity Checking Conflict data files for curation review Internal checking (blast/symbol) Blast against RGD database Level Two: Identity Checking • Check for identity conflicts • Check symbol • Check sequence via GB ID • Check sequence via BLAST • Check alias Preload: check for any attribute conflicts Level Three: Attribute Checking Conflict data files for curation review Curators to review flags Load: values without conflicts RGD database

  15. Examples of Checks • New symbol matches an RGD symbol • New symbol matches an alias in RGD • New record has a GBID • New GBID matches the RGD record • New GBID matches GBID of alias gene • New GBID matches any other RGD record • New Sequence matches any RGD • Every attribute value compared to RGD values

  16. Review Pipeline’s QC Checks

  17. Review Conflicts

  18. Excel Summary Report Conflict Data Report lists the bin ID for data that requires further curation (BLAST/BLAT analysis)

  19. Conflict Data Discovered by the Bulk Data Pipeline • Nomenclature conflicts • Symbols were incorrect • Sequence conflicts • Sequence reads were unacceptable due to poor quality (Many N’s) • Primers were switched • Sequence in dataset were associated with different objects in RGD • Alias conflicts • Dataset aliases were RGD objects • Dataset symbols were in RGD aliases • Attribute conflicts • Chromosomes were different in RGD • Cytological positions were different in RGD • Expected sizes of PCR products were different in RGD • Redundant data conflicts • Datasets had duplicate entries

  20. Curation of Conflicting Data Checking processes find conflicting data Manual curation to resolve conflicts Nomenclature, Sequence, Alias symbols, Attributes, Redundant records Irresolvable Resolvable Curated data Removed data Store data in file (Notify source) Load into RGD (Over-write current data)

  21. After Load

  22. Acknowledgements • Principal Investigators • Howard Jacob • Peter Tonellato • Simon Twigger • RGD Bioinformatics • Dean Pasko,Jiali Chen • Lan Zhao, Henry Fan, • Wenhua Wu, Jian Lu • Hanping Long • RGD Curation • Mary Shimoyama • Susan Bromberg • Rajni Nigam, Chin-fu Chen • Gopal Gopinathrao, Charles Wang • Victoria Petri • Dorothy Reilly, Cindy Foote • Angela Zuniga-Meyer, Nataliya Nenasheva

  23. Model Organism Bulk Data Processing Work Flow

  24. 1 New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record Set symbol flag to “IN_RGD_1“ The process will continue through the GenBank ID check 2 New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol Set symbol flag to “DIF_RGD_ID“ The process will continue through the GenBank ID check 3 New record has an RGD_ID, but new symbol does not match an active RGD symbol Set symbol flag to “DIF_SYMBOL“ The process will continue through the GenBank ID check 4 New record does not have an RGD_ID, but new symbol matches an active RGD symbol Set symbol flag to “IN_RGD_2” The process will continue through the GenBank ID check 5 New record does not have an RGD_ID and new symbol does not match an active RGD symbol Set symbol flag to “NEW” The process will continue through the GenBank ID check Case Number Case Description Expected Result Note

  25. Case Number Case Description Expected Result Note 6 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Change symbol flag to “IN_RGD_UPDATED” Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check 7 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_1” Change current symbol flag and continue through GenBank ID check 8 New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_2” Change current symbol flag and continue through GenBank ID check

  26. 9 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD Set flag to “DIF_9:RGD_ID“ This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD 10 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD Set flag to “DIF_10” This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review. 11 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias Set flag to “DIF_11” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review

  27. New Check Aliases Use Case Diagram C. Fan

  28. New Check Gene Symbol Use Case Diagram

  29. 1 New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record Set symbol flag to “IN_RGD_1“ The process will continue through the GenBank ID check 2 New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol Set symbol flag to “DIF_RGD_ID“ The process will continue through the GenBank ID check 3 New record has an RGD_ID, but new symbol does not match an active RGD symbol Set symbol flag to “DIF_SYMBOL“ The process will continue through the GenBank ID check 4 New record does not have an RGD_ID, but new symbol matches an active RGD symbol Set symbol flag to “IN_RGD_2” The process will continue through the GenBank ID check 5 New record does not have an RGD_ID and new symbol does not match an active RGD symbol Set symbol flag to “NEW” The process will continue through the GenBank ID check Case Number Case Description Expected Result Note

  30. Case Number Case Description Expected Result Note 6 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Change symbol flag to “IN_RGD_UPDATED” Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check 7 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_1” Change current symbol flag and continue through GenBank ID check 8 New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_2” Change current symbol flag and continue through GenBank ID check

  31. 9 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD Set flag to “DIF_9:RGD_ID“ This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD 10 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD Set flag to “DIF_10” This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review. 11 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias Set flag to “DIF_11” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review

  32. 12 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Set flag to “IN_RGD_2” The new symbol is changed to the RGD gene symbol of the gene associated with the matching alias and the data is loaded 13 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, but alias is associated with more than one gene Set flag to “DIF_13” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review

  33. Casess for GenBank ID check Bin # Symb. match GBID match specific RGD record GBID in new file GBID in RGD GBID match any RGD Seq match any RGD (BLAST) Symb. match alias GBID match GBID of alias gene Alias of more than one gene Flag Symbol/GBID/Alias 1 yes -- no yes -- -- -- -- -- DIF_ 1 New: A/- RGD: A/1 2 yes yes yes yes -- -- -- -- -- IN_RGD_1 New: A/1 RGD: A/1 3 yes no yes yes no no -- -- -- DIF_3 New: A/1 RGD: A/2 or – RGD: B/1 4 yes no yes yes yes or no yes -- -- -- DIF_4:RGD_ID New: A/1 RGD: A/2 RGD: B/1 5 no -- no -- -- -- -- -- -- DIF_5 New: A/- RGD: B/2 or - 6 no -- yes -- no no -- -- -- NEW New: A/1 RGD: B/2 or - 7 no -- yes yes yes or no yes no -- -- DIF_7:RGD_ID New: A/1/C RGD: B/1 8 yes -- no no -- -- -- DIF_8 New: A/- RGD: A/- 9 yes -- yes no yes or no yes -- -- -- DIF_9:RGD_ID New: A/1 RGD: A/- 10 yes -- yes no no no -- -- -- DIF_10 New: A/1 RGD: A/- 11 no -- yes -- yes -- yes no -- DIF_11 New: A/1 RGD: B/2/A 12 no -- yes -- yes -- yes yes no DIF_12 New: A/1 RGD: B/1/A 13 no -- yes -- yes -- yes yes yes DIF_13 New: A/1 RGD: B/1/A RGD: C/2/A

  34. Complete Bulkdata Pipeline Process Diagram C. Fan Input data Check data Pro-load data Load data

  35. Database Object Relationships genes sslps strains maps qtls references phenotypes diseases ESTs sequences homologs traits

  36. RGD Schema Diagram • 54 Tables • 10 Views

  37. RGD Schema Word Document

  38. RGD Database Technologies • Platforms • Database server: Oracle 8.1.6 • Sun Solaris 2.8 Unix operating system • Sun Enterprise 450’s • Programming Language • Perl 5 • Object-oriented Methodology • Database - object based schema • Perl modules – object based and globally used across systems • DB.pm module • PRELOAD.pm module • LOAD.pm module • Schema Documentation • Rational Rose 2000 Enterprise

  39. Bulk Data Database Schema

  40. Review Quality Control Reports

  41. Review

  42. Review

  43. Review

  44. Validation

  45. RGD Data Flow Templates Homologs Strains Genes QTLs SSLPs ESTs Map Data Development Modify flags dorado Bulk Data (Test-data) dev_1 Bulkdata All objects 1st Production alps Object Templates Text - tab delimited owner_1 Owner_2 Curation 2nd fuxi Bulk Data (Production-load) Cur_1 dss Curation data rgd.mcw.edu Public System Online Strains, References Nomenclature Gene editing Ontologies(rgdtogo.txt) Notes Online Genes QTLs Strains Modify flags Internal Systems

  46. Blast Result Scenarios

  47. Check for LocusLink, Swiss-Prot, RatMap IDs • New symbol matches RGD symbol • LL/SP/RM_ID in new file • LL/SP/RM_ID in specific RGD record • LL/SP/RM_ID matches specific RGD record • LL/SP/RM_ID matches any RGD Bin Number

More Related