550 likes | 559 Views
This paper discusses the integration of large amounts of data into the Rat Genome Database (RGD), with a focus on quality control and data submission tools. It outlines the challenges faced by the limited curation staff and the informatic methods used to address them.
E N D
Integration of New Data into RGD: Quality Control and Data Submission Tools Rat Genome Database http://rgd.mcw.edu Bioinformatics Research CenterMedical College of Wisconsin, Milwaukee, USA
RGD Pipeline Background • RGD is a relatively new MOD • Needed to integrate large amounts of historic data • Curation staff is limited • Developed near beginning of RGD Project • Efficient methods to evaluate and integrate data • Informatic methods were chosen to address the problems • Catch up with historic data • Achieve good productivity with limited staff • Modular design • New types of data • New QC checks and methods
Data Sources RGD Jackson Labs Markers, Strains, Genes Goteborg (Sweden) Genes, markers, QTLs Literature Otsuka MGD RatMap Otsuka (Japan) SSLPs MCW RHdb MIT All Objects EBI (UK) Markers, Primers RGD 2.0 WI/MIT Markers, Genetic Map MCO ARB Maps and SSLPs NIAMS SSLPs NCBI UI NCBI LocusLink, RefSeq, UniGene, etc. Baylor U. Iowa ESTs, RH Map Baylor (HGSC) Sequences
Databases Websites Literature Informatic data mining Bulk Data Pipeline in the Curation Process Regular Journal Screening And Curation Data Sources Internal Data External Data Data Pipeline RGD Database Ongoing Data Curation
RGD Objects • RGD stores information about • 11 fundamental data types (Objects) • Genes • Strains • QTLs • Traits • Sequences • ESTs • Maps • SSLPs • References • Homologs • Phenotypes
Relationships between RGD Objects Genes -> Genes, ESTs, SSLPs, and QTLs ESTs -> Genes SSLPs -> Genes and Strains QTLs -> Genes, Traits, Strains Traits -> QTLs Maps -> Maps Data Maps Data -> any RGD object References -> any RGD object Homologs -> any RGD object Strains -> any RGD object Sequences -> any RGD object Phenotypes -> any RGD object
Internal Data Sources QC functionality on data entry forms
Curation Annotations Notes Editor
Edit Record in Submission Database
RGD Data Flow Bulkdata All objects Production Curation Bulk Data (Production-load) owner_1 Owner_2 Cur_1 QC dss Curation data QC QC rgd.mcw.edu Online Genes QTLs Strains Public System Internal Systems
QC checks in the Data Flow Blasting results Input raw data BD Pipeline Database Check data RGD Database Preload data Load data • keep all raw data • format the data • track all checking flags • track all loading status Web-based interface to view all processing status
QC Process Overview Incoming Dataset Level One: Integrity Checking Conflict data files for curation review Internal checking (blast/symbol) Blast against RGD database Level Two: Identity Checking • Check for identity conflicts • Check symbol • Check sequence via GB ID • Check sequence via BLAST • Check alias Preload: check for any attribute conflicts Level Three: Attribute Checking Conflict data files for curation review Curators to review flags Load: values without conflicts RGD database
Examples of Checks • New symbol matches an RGD symbol • New symbol matches an alias in RGD • New record has a GBID • New GBID matches the RGD record • New GBID matches GBID of alias gene • New GBID matches any other RGD record • New Sequence matches any RGD • Every attribute value compared to RGD values
Review Pipeline’s QC Checks
Excel Summary Report Conflict Data Report lists the bin ID for data that requires further curation (BLAST/BLAT analysis)
Conflict Data Discovered by the Bulk Data Pipeline • Nomenclature conflicts • Symbols were incorrect • Sequence conflicts • Sequence reads were unacceptable due to poor quality (Many N’s) • Primers were switched • Sequence in dataset were associated with different objects in RGD • Alias conflicts • Dataset aliases were RGD objects • Dataset symbols were in RGD aliases • Attribute conflicts • Chromosomes were different in RGD • Cytological positions were different in RGD • Expected sizes of PCR products were different in RGD • Redundant data conflicts • Datasets had duplicate entries
Curation of Conflicting Data Checking processes find conflicting data Manual curation to resolve conflicts Nomenclature, Sequence, Alias symbols, Attributes, Redundant records Irresolvable Resolvable Curated data Removed data Store data in file (Notify source) Load into RGD (Over-write current data)
Acknowledgements • Principal Investigators • Howard Jacob • Peter Tonellato • Simon Twigger • RGD Bioinformatics • Dean Pasko,Jiali Chen • Lan Zhao, Henry Fan, • Wenhua Wu, Jian Lu • Hanping Long • RGD Curation • Mary Shimoyama • Susan Bromberg • Rajni Nigam, Chin-fu Chen • Gopal Gopinathrao, Charles Wang • Victoria Petri • Dorothy Reilly, Cindy Foote • Angela Zuniga-Meyer, Nataliya Nenasheva
Model Organism Bulk Data Processing Work Flow
1 New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record Set symbol flag to “IN_RGD_1“ The process will continue through the GenBank ID check 2 New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol Set symbol flag to “DIF_RGD_ID“ The process will continue through the GenBank ID check 3 New record has an RGD_ID, but new symbol does not match an active RGD symbol Set symbol flag to “DIF_SYMBOL“ The process will continue through the GenBank ID check 4 New record does not have an RGD_ID, but new symbol matches an active RGD symbol Set symbol flag to “IN_RGD_2” The process will continue through the GenBank ID check 5 New record does not have an RGD_ID and new symbol does not match an active RGD symbol Set symbol flag to “NEW” The process will continue through the GenBank ID check Case Number Case Description Expected Result Note
Case Number Case Description Expected Result Note 6 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Change symbol flag to “IN_RGD_UPDATED” Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check 7 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_1” Change current symbol flag and continue through GenBank ID check 8 New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_2” Change current symbol flag and continue through GenBank ID check
9 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD Set flag to “DIF_9:RGD_ID“ This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD 10 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD Set flag to “DIF_10” This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review. 11 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias Set flag to “DIF_11” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review
1 New record has an RGD_ID, and both symbol and RGD_ID of new record match an active RGD record Set symbol flag to “IN_RGD_1“ The process will continue through the GenBank ID check 2 New record has an RGD_ID, and new symbol matches an active RGD symbol, but new RGD_ID does not match the RGD_ID of matching symbol Set symbol flag to “DIF_RGD_ID“ The process will continue through the GenBank ID check 3 New record has an RGD_ID, but new symbol does not match an active RGD symbol Set symbol flag to “DIF_SYMBOL“ The process will continue through the GenBank ID check 4 New record does not have an RGD_ID, but new symbol matches an active RGD symbol Set symbol flag to “IN_RGD_2” The process will continue through the GenBank ID check 5 New record does not have an RGD_ID and new symbol does not match an active RGD symbol Set symbol flag to “NEW” The process will continue through the GenBank ID check Case Number Case Description Expected Result Note
Case Number Case Description Expected Result Note 6 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Change symbol flag to “IN_RGD_UPDATED” Change current symbol flag after changing symbol (GBID check) and continue through GenBank ID check 7 New record either has or does not have an RGD_ID, and new symbol does not match an active RGD symbol; new GBID matches a record in RGD and there is no alias matching that symbol; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_1” Change current symbol flag and continue through GenBank ID check 8 New record does not have an RGD_ID and new symbol does not matches an active RGD symbol; new record does not have a GBID; new symbol matches a retired or withdrawn symbol Change symbol flag to “DIF_NON_ACTIVE_2” Change current symbol flag and continue through GenBank ID check
9 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does match another GBID/seq in RGD Set flag to “DIF_9:RGD_ID“ This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review. Data without GBID match will be compared by BLAST after PRELOAD step in the pipeline, but NOT loaded until after curation review. RGD_ID is the value that is associated with the GBID/seq in RGD 10 New symbol matches an RGD symbol; there is a GBID in new data but not in RGD; new data GBID or sequence does not match any GBID/seq in RGD Set flag to “DIF_10” This case will be compared by BLAST after PRELOAD step in pipeline, but NOT loaded until after curation review. 11 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD but new GBID doesn’t match the GBID of the gene associated with the matching alias Set flag to “DIF_11” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review
12 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, and alias is associated with only one gene Set flag to “IN_RGD_2” The new symbol is changed to the RGD gene symbol of the gene associated with the matching alias and the data is loaded 13 New symbol does not match an RGD symbol; GBID matches a GBID in RGD; symbol matches an alias in alias table in RGD and new GBID matches the GBID of the gene associated with the matching alias, but alias is associated with more than one gene Set flag to “DIF_13” This case will complete the PRELOAD step in the pipeline, but NOT be loaded until after curation review
Casess for GenBank ID check Bin # Symb. match GBID match specific RGD record GBID in new file GBID in RGD GBID match any RGD Seq match any RGD (BLAST) Symb. match alias GBID match GBID of alias gene Alias of more than one gene Flag Symbol/GBID/Alias 1 yes -- no yes -- -- -- -- -- DIF_ 1 New: A/- RGD: A/1 2 yes yes yes yes -- -- -- -- -- IN_RGD_1 New: A/1 RGD: A/1 3 yes no yes yes no no -- -- -- DIF_3 New: A/1 RGD: A/2 or – RGD: B/1 4 yes no yes yes yes or no yes -- -- -- DIF_4:RGD_ID New: A/1 RGD: A/2 RGD: B/1 5 no -- no -- -- -- -- -- -- DIF_5 New: A/- RGD: B/2 or - 6 no -- yes -- no no -- -- -- NEW New: A/1 RGD: B/2 or - 7 no -- yes yes yes or no yes no -- -- DIF_7:RGD_ID New: A/1/C RGD: B/1 8 yes -- no no -- -- -- DIF_8 New: A/- RGD: A/- 9 yes -- yes no yes or no yes -- -- -- DIF_9:RGD_ID New: A/1 RGD: A/- 10 yes -- yes no no no -- -- -- DIF_10 New: A/1 RGD: A/- 11 no -- yes -- yes -- yes no -- DIF_11 New: A/1 RGD: B/2/A 12 no -- yes -- yes -- yes yes no DIF_12 New: A/1 RGD: B/1/A 13 no -- yes -- yes -- yes yes yes DIF_13 New: A/1 RGD: B/1/A RGD: C/2/A
Complete Bulkdata Pipeline Process Diagram C. Fan Input data Check data Pro-load data Load data
Database Object Relationships genes sslps strains maps qtls references phenotypes diseases ESTs sequences homologs traits
RGD Schema Diagram • 54 Tables • 10 Views
RGD Database Technologies • Platforms • Database server: Oracle 8.1.6 • Sun Solaris 2.8 Unix operating system • Sun Enterprise 450’s • Programming Language • Perl 5 • Object-oriented Methodology • Database - object based schema • Perl modules – object based and globally used across systems • DB.pm module • PRELOAD.pm module • LOAD.pm module • Schema Documentation • Rational Rose 2000 Enterprise
Review Quality Control Reports
RGD Data Flow Templates Homologs Strains Genes QTLs SSLPs ESTs Map Data Development Modify flags dorado Bulk Data (Test-data) dev_1 Bulkdata All objects 1st Production alps Object Templates Text - tab delimited owner_1 Owner_2 Curation 2nd fuxi Bulk Data (Production-load) Cur_1 dss Curation data rgd.mcw.edu Public System Online Strains, References Nomenclature Gene editing Ontologies(rgdtogo.txt) Notes Online Genes QTLs Strains Modify flags Internal Systems
Check for LocusLink, Swiss-Prot, RatMap IDs • New symbol matches RGD symbol • LL/SP/RM_ID in new file • LL/SP/RM_ID in specific RGD record • LL/SP/RM_ID matches specific RGD record • LL/SP/RM_ID matches any RGD Bin Number