130 likes | 143 Views
Explore a case study on developing a high-throughput phylogenomics platform, including conclusions and lessons for future research. Delve into the challenges, parameters, development cycle, and conclusions drawn from genome-scale data analysis. Learn about the essential factors, lessons learned, and the trade-offs between bespoke and off-the-shelf solutions. Discover the potential for real-time genomic transects and the missing fundamental data in various biological systems.
E N D
Developing a flexible platform for high-throughput phylogenomics:Case study, conclusions and lessons for the future Joe Parker, Georgia Tsakgogeorga, James A. Cotton and Stephen J. Rossiter Queen Mary University London
In case you hadn’t noticed.. “Recent advances in next-generation sequencing (NGS) technologies… now allow us to… in more detail than ever before…” - Every Grant Application Ever
Lab Interests • Ecology and evolution of traits • Echolocation, sociality • NGS data for population genetics and phylogenomics
The task • Phylogeny estimation/comparison • Molecular correlates of evolution; • site substitutions, dN/dS, composition • Simulation • Dataset limitations (R-L): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey
The parameters • De novo genomes: • four taxa • 2,321 protein-coding loci • 801,301 codons • Published: • 18 genomes • ~69,000 simulated datasets • ~3,500 cluster cores
Development cycle Design Alignment loadSequences() getSubstitutions() Phylogeny trimTaxa() getMRCA() Review, refine & refactor Wireframe & specify tests DataSeries calculateECDF() randomise() Regression getResiduals() predictInterval() Implement
Serialisation • Process data remotely • Freeze-dry objects, download to desktop • Implement new methods directly on previously-analysed data
Junichuro Ayoyama http://www.flickr.com/people/24841050@N00 Conclusions
Distributions • Genome-scale data provides context • Identify outliers Genes / taxa / trees • Compare values across biological systems
Parameter investigation • Multiple configurations • Hyperparameters empirically investigated • Determine sensitivity of results
Lessons • Well-defined research questions: • ‘Find the best tree’ • ‘Estimate dN/dS’ • Questions arise from data: • ‘How many genes have at least k substitutions in k or more taxa?’ • Data-hypothesis-analysis cycle implies feature creep • Use of available databases, e.g. ontology; orthology; expression
Sequence reads = observations • Unlimited flexibility, finite time • Development trade-off • Off-the-shelf • Bespoke • Exploratory work • Real time genomic transects? • Essential fundamental data missing from nearly every system; • Diversity; structure; substitution rates; dN/dS; recombination; dispersal; lateral transfer
Thanks Steve Rossiter1, James Cotton2, Elia Stupka3 & Georgia Tsagkogeorga1 1School of Biological and Chemical Sciences, Queen Mary, University of London 2Wellcome Trust Sanger Institute 3Center for Translational Genomics and Bioinformatics, San Raffaele Institute, Milan Chris Walker & Dan Traynor Queen Mary GridPP High-throughput Cluster Chaz Mein & Anna Terry Barts and The London Genome Centre Mahesh Pancholi School of Biological and Chemical Sciences BBSRC (UK); Queen Mary, University of London