1 / 27

Developments at Sanger

Developments at Sanger. Anthony Rogers Wellcome Trust Sanger Institute. Overview. The build procedure Stats for the year Team changes Model changes. “new gene model” Variation Future plans InterPro improved mapping of data to genes move off wormsrv2 new nematodes

adlai
Download Presentation

Developments at Sanger

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developments at Sanger Anthony Rogers Wellcome Trust Sanger Institute.

  2. Overview • The build procedure • Stats for the year • Team changes • Model changes. • “new gene model” • Variation • Future plans • InterPro • improved mapping of data to genes • move off wormsrv2 • new nematodes • new data types

  3. Washington University in St. Louis Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology • RNAi • Microarray • Anatomy / Cell • Homology groups (KOGS) • SAGE data • Gene Ontology • Papers / References • Person / Author • Detailed Functional Annotation The WormBase Consortium • Gene prediction annotation • SNPs Gene prediction annotation Genetic Data Alleles Gene name info ( incl unique ids ) Strains Data Integration and analysis • PCR_products / Oligos • 3D structures • Yeast 2 Hybrid interactions Website and tools

  4. Sanger CalTech WashU CSHL Sanger Compute Farm WormBase mysql DNA Blastx, blastp, RepeatMask PFAM, tmhmm etc EMBL WORMPEP Align all cDNAs and build transcripts Map expt data eg RNAi, oligos, Alleles Load homology data Export GFF, agp, DNA files. Build release files Dev site To FTP site and CSHL Build Overview

  5. Release cycle • From WS124 (March 2004) – WS150 (October 2005) - 26 releases. • All but 2 of these were on schedule. • Those that were late were due to Sanger wide systems problems associated with moving to new building. • After W134 changed (with SAB approval) to three weekly cycle. • If releases on time - Why? • Increases in data meant gradual increase in time. • Lots of releases were “Just in time” • Time pressure meant that fixes weren’t been made properly. • Reduced staff meant that less development was being done.

  6. More polyA / TSL etc and fixing BLAT errors Gene stats

  7. Experimental Data Stats I New data class

  8. Experimental Data Stats II Incorporation of genome wide experiments

  9. Other classes of interest InParanoid

  10. Keith Bradnam Dan Lawson Choa-Kung Chen Staff Changes Mary Ann Tuli Michael Han Gary Williams • Great improvement in documentation of procedures. • Gene structure curation • Allele curation • genetic map functions in acedb • Sequence feature annotation ( polyA, TSL) • Fresh view of methods for doing things.

  11. “Where is the new Gene model Keith!?!”

  12. The problem • Worm genes first existed as Locus objects • e.g. dpy-1 • Then genes existed asSequence objects • e.g. F31D4.3 • Some genes exist as both LocusandSequence objects • Gene names change…a lot!

  13. ptp-1 ptp-3a ptp-3b C09D8.1 C09D8.1a ptp-1 ypp-1 ptp-3 C09D8.1b YPP/1 Locus Sequence The Plan Other names CDS Sequence name Main CGC name Gene WBGene0000001

  14. C09D8.1 ptp-3a ptp-3b C09D8.1c C09D8.1a Gene ptp-1 WBGene0000001 ypp-1 ptp-3 abc-1 C09D8.1b YPP/1 Linking to a gene Antibody Paper [cgc4265] Allele RNAi result

  15. Progress ! Progress! • The (no longer new) Gene model is in place. • All Genes now have Gene_ids • Gene history tracking info stored • merges, splits etc • Next part of the plan was to have a central database serving ids

  16. Sanger “single sign-on” User specific operations Working version Operation selection Not just WBGene_ids - Variation, RNAi, Person

  17. Variation Model Locus SNPs Classical Genes Gene Clusters Allele Deletions Transposon_insertions Lots of shared data structures (Tags) eg Mapping data, Names, connections to CDSs Variation Greater code efficiency and managability for both build and web Easier to search

  18. Imminent arrivals and the Future • InterPro • Refined Mapping • Moving build machine • New nematodes • New data types

  19. InterPro • Useful data used in many other resources so a good ‘point of reference for non-worm specialists. • We previously got ours from UniProt or ad hoc from St Louis. • Many databases are covered by InterPro. • Prosite, Prints, Pfam, SMART, PIRSF, etc. • Usual way of searching for database hits is to use interproscan, but this is incompatible with Sanger farm. • Run each database search individually using existing architecture from BLAST etc and stores the results. • We merge hits with the same InterPro ID

  20. Merging hits from databases Protein Results similar but not identical to iprscan

  21. InterPro hits per protein 15 Proteins with >100 domains (max. 186)

  22. ... SL-2/1 AAAAA.... Improved Mapping of Variations to Genes Variations We can describe much more accurately how a mutation affects a gene . . - donor and acceptor splice sites - introns / exons - motifs like polyAs and TSLs ... and for coding changes give the amino acid differences.

  23. sra-9 ttc tta F L Predicted snp_AH6[1] Currently only connection to Gene Future will specify that the SNP is in coding sequence and that it causes a specified amino acid change. Described by Tags in the database, so searchable.

  24. All chromosomes can be run in parallel cbi1 = 3 x 2cpu I X II V III IV Implementation x GFF data exons, introns, transcripts, SNPS, alleles etc One table per chromosome, so all can be loaded together

  25. Bought shiney fast new computer Become too slow and isolation is a pain Death of wormsrv2 Death of wormsrv2 5 years ago Sanger network = bad Now Sanger network = Good ! Move to use informatics cluster - fast and parallel Means modification of majority of code base

  26. New nematode genomes New nematodes • C. briggsae is a forerunner . . . • semi-curated geneset • brigpep2 • protein annotation ( PFAM, tmhmm, signalp ) • ortholog assignment ( InParanoid - Erich Sonnhammer ) • blastp • blastx • waba ( Jim Kent’s genome alignment tool ) • We intend to do all of this for each of the new genomes. • Mostly done for C.remanei

  27. New Data Types Any new data types impact on build new model development scripts to integrate and check the data Eg Mass spec data: Been in contact with Gennifer Merrihew

More Related