1 / 13

Joe Parker, Georgia Tsakgogeorga, James A. Cotton and Stephen J. Rossiter

Developing a flexible platform for high-throughput phylogenomics: Case study, conclusions and lessons for the future. Joe Parker, Georgia Tsakgogeorga, James A. Cotton and Stephen J. Rossiter Queen Mary University London. In case you hadn’t noticed.

manuelmoore
Download Presentation

Joe Parker, Georgia Tsakgogeorga, James A. Cotton and Stephen J. Rossiter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing a flexible platform for high-throughput phylogenomics:Case study, conclusions and lessons for the future Joe Parker, Georgia Tsakgogeorga, James A. Cotton and Stephen J. Rossiter Queen Mary University London

  2. In case you hadn’t noticed.. “Recent advances in next-generation sequencing (NGS) technologies… now allow us to… in more detail than ever before…” - Every Grant Application Ever

  3. Lab Interests • Ecology and evolution of traits • Echolocation, sociality • NGS data for population genetics and phylogenomics

  4. The task • Phylogeny estimation/comparison • Molecular correlates of evolution; • site substitutions, dN/dS, composition • Simulation • Dataset limitations (R-L): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey

  5. The parameters • De novo genomes: • four taxa • 2,321 protein-coding loci • 801,301 codons • Published: • 18 genomes • ~69,000 simulated datasets • ~3,500 cluster cores

  6. Development cycle Design Alignment loadSequences() getSubstitutions() Phylogeny trimTaxa() getMRCA() Review, refine & refactor Wireframe & specify tests DataSeries calculateECDF() randomise() Regression getResiduals() predictInterval() Implement

  7. Serialisation • Process data remotely • Freeze-dry objects, download to desktop • Implement new methods directly on previously-analysed data

  8. Junichuro Ayoyama http://www.flickr.com/people/24841050@N00 Conclusions

  9. Distributions • Genome-scale data provides context • Identify outliers Genes / taxa / trees • Compare values across biological systems

  10. Parameter investigation • Multiple configurations • Hyperparameters empirically investigated • Determine sensitivity of results

  11. Lessons • Well-defined research questions: • ‘Find the best tree’ • ‘Estimate dN/dS’ • Questions arise from data: • ‘How many genes have at least k substitutions in k or more taxa?’ • Data-hypothesis-analysis cycle implies feature creep • Use of available databases, e.g. ontology; orthology; expression

  12. Sequence reads = observations • Unlimited flexibility, finite time • Development trade-off • Off-the-shelf • Bespoke • Exploratory work • Real time genomic transects? • Essential fundamental data missing from nearly every system; • Diversity; structure; substitution rates; dN/dS; recombination; dispersal; lateral transfer

  13. Thanks Steve Rossiter1, James Cotton2, Elia Stupka3 & Georgia Tsagkogeorga1 1School of Biological and Chemical Sciences, Queen Mary, University of London 2Wellcome Trust Sanger Institute 3Center for Translational Genomics and Bioinformatics, San Raffaele Institute, Milan Chris Walker & Dan Traynor Queen Mary GridPP High-throughput Cluster Chaz Mein & Anna Terry Barts and The London Genome Centre Mahesh Pancholi School of Biological and Chemical Sciences BBSRC (UK); Queen Mary, University of London

More Related