Flexible Platform Development for Phylogenomics: Case Study & Future Insights

Developing a flexible platform for high-throughput phylogenomics:Case study, conclusions and lessons for the future Joe Parker, Georgia Tsakgogeorga, James A. Cotton and Stephen J. Rossiter Queen Mary University London

In case you hadn’t noticed.. “Recent advances in next-generation sequencing (NGS) technologies… now allow us to… in more detail than ever before…” - Every Grant Application Ever

Lab Interests • Ecology and evolution of traits • Echolocation, sociality • NGS data for population genetics and phylogenomics

The task • Phylogeny estimation/comparison • Molecular correlates of evolution; • site substitutions, dN/dS, composition • Simulation • Dataset limitations (R-L): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey

The parameters • De novo genomes: • four taxa • 2,321 protein-coding loci • 801,301 codons • Published: • 18 genomes • ~69,000 simulated datasets • ~3,500 cluster cores

Development cycle Design Alignment loadSequences() getSubstitutions() Phylogeny trimTaxa() getMRCA() Review, refine & refactor Wireframe & specify tests DataSeries calculateECDF() randomise() Regression getResiduals() predictInterval() Implement

Serialisation • Process data remotely • Freeze-dry objects, download to desktop • Implement new methods directly on previously-analysed data

Junichuro Ayoyama http://www.flickr.com/people/24841050@N00 Conclusions

Distributions • Genome-scale data provides context • Identify outliers Genes / taxa / trees • Compare values across biological systems

Parameter investigation • Multiple configurations • Hyperparameters empirically investigated • Determine sensitivity of results

Lessons • Well-defined research questions: • ‘Find the best tree’ • ‘Estimate dN/dS’ • Questions arise from data: • ‘How many genes have at least k substitutions in k or more taxa?’ • Data-hypothesis-analysis cycle implies feature creep • Use of available databases, e.g. ontology; orthology; expression

Sequence reads = observations • Unlimited flexibility, finite time • Development trade-off • Off-the-shelf • Bespoke • Exploratory work • Real time genomic transects? • Essential fundamental data missing from nearly every system; • Diversity; structure; substitution rates; dN/dS; recombination; dispersal; lateral transfer

Thanks Steve Rossiter1, James Cotton2, Elia Stupka3 & Georgia Tsagkogeorga1 1School of Biological and Chemical Sciences, Queen Mary, University of London 2Wellcome Trust Sanger Institute 3Center for Translational Genomics and Bioinformatics, San Raffaele Institute, Milan Chris Walker & Dan Traynor Queen Mary GridPP High-throughput Cluster Chaz Mein & Anna Terry Barts and The London Genome Centre Mahesh Pancholi School of Biological and Chemical Sciences BBSRC (UK); Queen Mary, University of London

Flexible Platform Development for Phylogenomics: Case Study & Future Insights

Flexible Platform Development for Phylogenomics: Case Study & Future Insights

Presentation Transcript

Stephen J. Kelin, CPA, JD

J Ryan Allen Advisor: Joe Bishop

Stephen J. Ware

Stephen J. Blumberg, Ph.D.

X. Zhang , J. Liu, E. T. Parker and R. J. Weber Georgia Institute of Technology

Spanking Stephen J. Bavolek, Ph.D.

Stephen Forte Sarah Parker Melinda Winans

The Cotton Gin and Railroads New Technology in Georgia

J. Stephen Huff, MD

By: Michael Bailey James Roe Mike Parker

James Colgan, Honglin Zhang, Christopher Fontes, and Joe Abdallah,

Simulation Analysis and Economic Impact of Georgia Cotton Production

Stephen J. Klein President

The Stephen J. Sylvain

J. Stephen Huff, MD

James J. Hughes Ph.D.

Cotton An Overview of the Cotton Industry in Georgia

Stephen J. Kiraly, MD, FRCPC

Stephen J. Gerace Elementary School

James J. Ferguson, MD

Stephen J. Ware

James Joyce, Stephen Hero