270 likes | 390 Views
Developments at Sanger. Anthony Rogers Wellcome Trust Sanger Institute. Overview. The build procedure Stats for the year Team changes Model changes. “new gene model” Variation Future plans InterPro improved mapping of data to genes move off wormsrv2 new nematodes
E N D
Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.
Overview • The build procedure • Stats for the year • Team changes • Model changes. • “new gene model” • Variation • Future plans • InterPro • improved mapping of data to genes • move off wormsrv2 • new nematodes • new data types
Washington University in St. Louis Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology • RNAi • Microarray • Anatomy / Cell • Homology groups (KOGS) • SAGE data • Gene Ontology • Papers / References • Person / Author • Detailed Functional Annotation The WormBase Consortium • Gene prediction annotation • SNPs Gene prediction annotation Genetic Data Alleles Gene name info ( incl unique ids ) Strains Data Integration and analysis • PCR_products / Oligos • 3D structures • Yeast 2 Hybrid interactions Website and tools
Sanger CalTech WashU CSHL Sanger Compute Farm WormBase mysql DNA Blastx, blastp, RepeatMask PFAM, tmhmm etc EMBL WORMPEP Align all cDNAs and build transcripts Map expt data eg RNAi, oligos, Alleles Load homology data Export GFF, agp, DNA files. Build release files Dev site To FTP site and CSHL Build Overview
Release cycle • From WS124 (March 2004) – WS150 (October 2005) - 26 releases. • All but 2 of these were on schedule. • Those that were late were due to Sanger wide systems problems associated with moving to new building. • After W134 changed (with SAB approval) to three weekly cycle. • If releases on time - Why? • Increases in data meant gradual increase in time. • Lots of releases were “Just in time” • Time pressure meant that fixes weren’t been made properly. • Reduced staff meant that less development was being done.
More polyA / TSL etc and fixing BLAT errors Gene stats
Experimental Data Stats I New data class
Experimental Data Stats II Incorporation of genome wide experiments
Other classes of interest InParanoid
Keith Bradnam Dan Lawson Choa-Kung Chen Staff Changes Mary Ann Tuli Michael Han Gary Williams • Great improvement in documentation of procedures. • Gene structure curation • Allele curation • genetic map functions in acedb • Sequence feature annotation ( polyA, TSL) • Fresh view of methods for doing things.
“Where is the new Gene model Keith!?!”
The problem • Worm genes first existed as Locus objects • e.g. dpy-1 • Then genes existed asSequence objects • e.g. F31D4.3 • Some genes exist as both LocusandSequence objects • Gene names change…a lot!
ptp-1 ptp-3a ptp-3b C09D8.1 C09D8.1a ptp-1 ypp-1 ptp-3 C09D8.1b YPP/1 Locus Sequence The Plan Other names CDS Sequence name Main CGC name Gene WBGene0000001
C09D8.1 ptp-3a ptp-3b C09D8.1c C09D8.1a Gene ptp-1 WBGene0000001 ypp-1 ptp-3 abc-1 C09D8.1b YPP/1 Linking to a gene Antibody Paper [cgc4265] Allele RNAi result
Progress ! Progress! • The (no longer new) Gene model is in place. • All Genes now have Gene_ids • Gene history tracking info stored • merges, splits etc • Next part of the plan was to have a central database serving ids
Sanger “single sign-on” User specific operations Working version Operation selection Not just WBGene_ids - Variation, RNAi, Person
Variation Model Locus SNPs Classical Genes Gene Clusters Allele Deletions Transposon_insertions Lots of shared data structures (Tags) eg Mapping data, Names, connections to CDSs Variation Greater code efficiency and managability for both build and web Easier to search
Imminent arrivals and the Future • InterPro • Refined Mapping • Moving build machine • New nematodes • New data types
InterPro • Useful data used in many other resources so a good ‘point of reference for non-worm specialists. • We previously got ours from UniProt or ad hoc from St Louis. • Many databases are covered by InterPro. • Prosite, Prints, Pfam, SMART, PIRSF, etc. • Usual way of searching for database hits is to use interproscan, but this is incompatible with Sanger farm. • Run each database search individually using existing architecture from BLAST etc and stores the results. • We merge hits with the same InterPro ID
Merging hits from databases Protein Results similar but not identical to iprscan
InterPro hits per protein 15 Proteins with >100 domains (max. 186)
... SL-2/1 AAAAA.... Improved Mapping of Variations to Genes Variations We can describe much more accurately how a mutation affects a gene . . - donor and acceptor splice sites - introns / exons - motifs like polyAs and TSLs ... and for coding changes give the amino acid differences.
sra-9 ttc tta F L Predicted snp_AH6[1] Currently only connection to Gene Future will specify that the SNP is in coding sequence and that it causes a specified amino acid change. Described by Tags in the database, so searchable.
All chromosomes can be run in parallel cbi1 = 3 x 2cpu I X II V III IV Implementation x GFF data exons, introns, transcripts, SNPS, alleles etc One table per chromosome, so all can be loaded together
Bought shiney fast new computer Become too slow and isolation is a pain Death of wormsrv2 Death of wormsrv2 5 years ago Sanger network = bad Now Sanger network = Good ! Move to use informatics cluster - fast and parallel Means modification of majority of code base
New nematode genomes New nematodes • C. briggsae is a forerunner . . . • semi-curated geneset • brigpep2 • protein annotation ( PFAM, tmhmm, signalp ) • ortholog assignment ( InParanoid - Erich Sonnhammer ) • blastp • blastx • waba ( Jim Kent’s genome alignment tool ) • We intend to do all of this for each of the new genomes. • Mostly done for C.remanei
New Data Types Any new data types impact on build new model development scripts to integrate and check the data Eg Mass spec data: Been in contact with Gennifer Merrihew