1 / 22

Introduction to DAS / State of the Union

Introduction to DAS / State of the Union. Tim Hubbard th@sanger.ac.uk DAS developer workshop 10th March 2009 Wellcome Trust Genome Campus. Distributed Annotation System . or How I Learnt to Stop Worrying and Love Data Federation. Credit: Andreas Prli ć. Distributed Annotation System.

tosca
Download Presentation

Introduction to DAS / State of the Union

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to DAS / State of the Union Tim Hubbard th@sanger.ac.uk DAS developer workshop 10th March 2009 Wellcome Trust Genome Campus

  2. Distributed Annotation System or How I Learnt to Stop Worrying and Love Data Federation Credit: Andreas Prlić

  3. Distributed Annotation System • Origins: • xml client/server specification (http://biodas.org/) • Lincoln Stein, Sean Eddy, Robin Dowell and LaDeana Hillier • acedb based prototype server • Java based prototype client • Dowell, R.D., Jokerst, R.M., Day, A., Eddy, S.R. & Stein, L. (2001) BioMedCentral Bioinformatics 2. • Genome campus adoption • Initially via Ensembl becoming a DAS client (now also a DAS server) • Code: Dazzle and Proserver servers; Bio::DASLite and biojava client libraries • Hosts DAS registry (http://www.dasregistry.org/)

  4. DAS in a nutshell • Standardized set of web services • Reference servers (the sequence) • Annotation servers (features: chr:start-end) • Alignment servers (chr:start-end matches chr:start-end) • Identifier based servers (ref item X rather than coordinate) • Standardization allows clients to connect to different DAS sources without additional programming

  5. Data integration • Complete genomes provide the framework to pull all biological data together such that each piece says something about biology as a whole • Biology is too complex for any organisation to have a monopoly of ideas or data • The more organisations provide data or analysis separately, the harder it becomes for anyone to make use of the results

  6. Utility of bioinformatics Scientific impact Too little bioinformatics Too many databases Too diverse interfaces

  7. Split data and presentation • Databases responsible for curating data and serving it as primitive datatypes defined by open standards (high cost) • Different front ends or components of front ends compete for users (development of each low cost) c.f. browsers.

  8. Data Services

  9. Data Services

  10. Data Services

  11. e! contigview epigenome Apollo 3D structure Servers Campus DAS systems Clients Genome Coordinates Dazzle CDS Coordinates Sources Ensembl Pfam UniProt PubMed COSMIC Proserver e! geneview Protein Coordinates LDAS otterlace Stable Identifiers Pfam Sequence Alignments Registry

  12. Rise of Federation Technologies • DAS for features • BioMart for data mining • BioMart server is a DAS server • New international genome data projects • routinely using the F word • frequently the D and B words too • e.g. International Cancer Genome Consortium

  13. DAS infrastructure status • Lots of progress • Servers: Dazzle, Proserver, MyDas, Bio::Daslite • Clients: Ensembl, Vega, Dasty, SPICE, Pfam, Jalview, Pepper, IGB • >500 sources in DAS registry (http://www.dasregistry.org/) • Broadly adopted by large scale projects: Ensembl, biosapiens, efamily, ZF-models, eProtein, ENCODE annotation • Extensions in 1.53E: stylesheets, semantic zooming, ontology support, timestamps, interactions • Planned 1.6: incorporating some features of DAS2 specification • Better adoption of DAS in US • Opportunities • Searching, writeback • Source ranking, credit, social networking • Inter-client communications protocol • Async delivery/caching; servers built on servers/workflows • Alternative entry points from servers? Next left/right? Date of addition?

  14. 2008 the year of… • Open access to publications • PMC, ukPMC, Zotero, Papers, MyNCBI, Citeulike, Connotea, 2collab and HubMed • All WT funded publications open in 6 months • All NIH funded publications open in 12 months • DAS for publications? • Text is just a new coordinate system • Links to Social Networks? • Google OpenSocial • Still waiting…

  15. 2009 the year of… • Massive datasets • Track likely to be 50 million solexa transcriptome reads • Need: • Better ways for users to create tracks for large datasets

  16. Problems of large user data(credits to Jim Kent, UCSC) • Easy to generate 1 GB files with next gen sequencing. • 25 million tag mappings at 40 bytes each • Potential to translate into histograms with 1 floating point number every 12 bases • Slow to load into MySQL database backend to local DAS server; many users will not want to setup DAS servers • Too large to upload to remote DAS server services (e.g. Ensembl) to create track • Most users only look at 5-50 sites - less than 1% genome

  17. Jim Kent’s idea • User runs program to convert their data into single indexed file (BigWig & BigBed) • Place on their website • UCSC browser fetches parts of file on demand using http(s) “byte range” queries • Relationship to DAS? • Potential to create DAS server plugin to serve BigWig/BigBed files as DAS servers

  18. Acknowledgements Ewan Birney Tony Cox Thomas Down Rob Finn Stefan Graf David Jackson Andreas Kahari Eugene Kulesha Henning Hermjakob Roger Pettett Matt Pocock James Smith Jim Stalker Janet Thornton Jonathan Warren Andy Jenkinson Andreas Prlic Ensembl/Sanger Web team efamily, biosapiens, eProtein Zebrafish analysis (ZF-models) Anacode/Acedb (otterlace/Zmap)

  19. 2009 the year of… • Massive datasets • Track likely to be 50 million solexa transcriptome reads • Private datasets • EGA requires registration and logins • Even summary data currently not public • Need: • Better ways for users to create tracks for large datasets • Federated access controls for patient data

  20. Todo: tilling array DAS stylesheet magic (Eugene Kulesha)

More Related