1 / 9

GO Production Report

GO Production Report. GOC meeting 7/24/07 Princeton, NJ. Hardware and sysadmin. Deployed 4 new linux machines 1 loading 2 x production AmiGO 1 x development AmiGO Production AmiGO now much more fault resistant, not sharing web server & database CPU with SGD

Download Presentation

GO Production Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GO Production Report GOC meeting 7/24/07 Princeton, NJ

  2. Hardware and sysadmin • Deployed 4 new linux machines • 1 loading • 2 x production AmiGO • 1 x development AmiGO • Production AmiGO now much more fault resistant, not sharing web server & database CPU with SGD • Minor improvements to pipeline

  3. GO Database • Brought production code up to date with CVS • Improved sequence loading speed and accuracy • Final testing of “bulk” loading - predicted improvement (assoc only): golite: 24hrs -> 16 hrs gofull: 11dy -> 4dy

  4. godb sequences • Loading into db improved by better batching (and faster, ~4hr golite) • Created monthly reports for gp2protein files (220K gps have only 97K seqs) • Can possibly save another 3 hr loading time for golite, 24 hr for gofull

  5. gp2protein report • Monthly email reports 3 numbers: • entries in gp2protein with no assoc • (or IEA only, just FYI) • entries in gp2protein for which sequence could not be retrieved (obsolete ids?) • gene products in assoc file without protein sequences

  6. assocdb fasta files • Currently, we supply a fasta file that contains all GO term ids in the header, along with many DBXREFs. • We can reduce golite loading by ~3 hrs and gofull loading by ~24 hrs by just dumping a fasta file with “basic” headers

  7. fasta headers NEW (Proposed) >UNIPROT|Q9XHP0 Uniprot:11S2_SESIN … >TAIR|gene:1009021752 NCBI:NP_001030618 … OLD >UNIPROT|Q9XHP0 - symbol:11S2_SESIN "11S globulin seed storage protein 2 precursor" species:4182 "Sesamum indicum" [GO:0042735 "protein body" evidence=NAS] [GO:0045735 "nutrient reservoir activity" evidence=NAS] [GO:0051259 "protein oligomerization" evidence=NAS] UniProt:Q9XHP0 INTERPRO:IPR006045 Pfam:PF00190 EMBL:AF091842 HSSP:P04776 GO:GO:0042735 GO:GO:0045735 GO:GO:0051259 InterPro:IPR014710 InterPro:IPR006044 Gene3D:G3DSA:2.60.120.10 PRINTS:PR00439 PROSITE:PS00305 NCBI sequences were not being loaded correctly in the production code, so I don't have an example.

  8. association file filtering • New active filters: • Added filter on IEA annotations without WITH field, as of May 1, 2007.All new IEA annotations must have an ID in the WITH field. (Jan 07) • Added more text to the output of the -e option to the report (Aug 07) • Proposed filters: • Check for double colons ('::') in DB_OBJECT_ID, GOID, REFERENCE, WITH and TAXON ID fields. • Check for multiple DB_OBJECT_SYMBOLs associated with a DB_OBJECT_ID.

  9. GO Production Services Is: • Stuart Miyasato • Stan Dong • Ben Hitz • Gail Binkley • Mike Cherry go-admin@genome.stanford.edu

More Related