90 likes | 107 Views
GO Production Report. GOC meeting 7/24/07 Princeton, NJ. Hardware and sysadmin. Deployed 4 new linux machines 1 loading 2 x production AmiGO 1 x development AmiGO Production AmiGO now much more fault resistant, not sharing web server & database CPU with SGD
E N D
GO Production Report GOC meeting 7/24/07 Princeton, NJ
Hardware and sysadmin • Deployed 4 new linux machines • 1 loading • 2 x production AmiGO • 1 x development AmiGO • Production AmiGO now much more fault resistant, not sharing web server & database CPU with SGD • Minor improvements to pipeline
GO Database • Brought production code up to date with CVS • Improved sequence loading speed and accuracy • Final testing of “bulk” loading - predicted improvement (assoc only): golite: 24hrs -> 16 hrs gofull: 11dy -> 4dy
godb sequences • Loading into db improved by better batching (and faster, ~4hr golite) • Created monthly reports for gp2protein files (220K gps have only 97K seqs) • Can possibly save another 3 hr loading time for golite, 24 hr for gofull
gp2protein report • Monthly email reports 3 numbers: • entries in gp2protein with no assoc • (or IEA only, just FYI) • entries in gp2protein for which sequence could not be retrieved (obsolete ids?) • gene products in assoc file without protein sequences
assocdb fasta files • Currently, we supply a fasta file that contains all GO term ids in the header, along with many DBXREFs. • We can reduce golite loading by ~3 hrs and gofull loading by ~24 hrs by just dumping a fasta file with “basic” headers
fasta headers NEW (Proposed) >UNIPROT|Q9XHP0 Uniprot:11S2_SESIN … >TAIR|gene:1009021752 NCBI:NP_001030618 … OLD >UNIPROT|Q9XHP0 - symbol:11S2_SESIN "11S globulin seed storage protein 2 precursor" species:4182 "Sesamum indicum" [GO:0042735 "protein body" evidence=NAS] [GO:0045735 "nutrient reservoir activity" evidence=NAS] [GO:0051259 "protein oligomerization" evidence=NAS] UniProt:Q9XHP0 INTERPRO:IPR006045 Pfam:PF00190 EMBL:AF091842 HSSP:P04776 GO:GO:0042735 GO:GO:0045735 GO:GO:0051259 InterPro:IPR014710 InterPro:IPR006044 Gene3D:G3DSA:2.60.120.10 PRINTS:PR00439 PROSITE:PS00305 NCBI sequences were not being loaded correctly in the production code, so I don't have an example.
association file filtering • New active filters: • Added filter on IEA annotations without WITH field, as of May 1, 2007.All new IEA annotations must have an ID in the WITH field. (Jan 07) • Added more text to the output of the -e option to the report (Aug 07) • Proposed filters: • Check for double colons ('::') in DB_OBJECT_ID, GOID, REFERENCE, WITH and TAXON ID fields. • Check for multiple DB_OBJECT_SYMBOLs associated with a DB_OBJECT_ID.
GO Production Services Is: • Stuart Miyasato • Stan Dong • Ben Hitz • Gail Binkley • Mike Cherry go-admin@genome.stanford.edu