130 likes | 233 Views
Phyloinformatics of Neuraminidase at Micro and Macro Levels using Grid-enabled HPC Technologies B. Schmidt (UNSW) D.T. Singh (Genvea Biosciences) R. Trehan, T. Bretschneider (NTU, Singapore). Contents. H5N1 Genetics H5N1 Phyloinformatics Design Principles of Quascade
E N D
Phyloinformatics of Neuraminidase at Micro and Macro Levels using Grid-enabled HPC Technologies B. Schmidt (UNSW) D.T. Singh (Genvea Biosciences) R. Trehan, T. Bretschneider (NTU, Singapore)
Contents • H5N1 Genetics • H5N1 Phyloinformatics • Design Principles of Quascade • H5N1 Phyloinformatics with Quascade • Results • Conclusion and Future work
H5N1 Genetics • Belongs to the Influenza A virus type • Segmented RNA genome • 8 genes, 11 proteins • Classification based on: • Hemagglutinin (HA): 15 subtypes • Neuraminidase (NA): 9 subtypes • Genetic variations in HA/NA • Genetic drift • Point mutations • 1918 Spanish flu • Genetic shift • Reassortment of the segmented genome • 1957, 1968, 1997 pandemics • 2003 Z strain of H5N1
H5N1 Phyloinformatics • Essential to monitor new emerging strains • Molecular evolution at gene and genome level • Phylogenetic analysis for determining the origin of new strains • Phylogenetics • How fast do proteins evolve? • What is the best method to measure the evolution? • How to obtain the best phylogenetic tree? • Phylogenetic algorithms • Character based • Maximum Parsimony, Maximum Likelihood (ML) • Distance based • UPGMA, Neighborhood Join (NJ) • Bayesian MCMC based • Mr. Bayes, BEAST
Quascade – User Interface Example Processing pipeline • Communication • A data-flow tool in which each black-box represents Java objects running on different computers! • Assignment of objects to available computers done automatically (manually if required) • Communication between objects done transparently • Configuration of objects done before run-time
Java Object Java Object Java Object Object Features • Coding in regular Java/ C/ C++ • Persistent – activated whenever all data-inputs present • No explicit messaging protocol required • No distributed computing concepts need to be understood • Objects automatically or manually assigned to computers / CPU-cores
Data and Algorithms • Core Group • 22 H5N1 NA sequences from SwissProt and TREMBL • Medium Set • 581 NA H5N1 sequences from Uniprot • Large Set • 909 NA Influenza A sequences from Uniprot • ProtDist • NJ • UPGMA • ProtPars • ProtML • Mr. Bayes
Distance-based workflow MP workflow 400 400 360 360 300 300 200 Processing time [h] Processing time [h] 200 145 140 100 100 16 16 6 5 0 0 909sequences 581sequences 909sequences 581sequences Runtime and Scalability (NA Bird Flu Protein) • 25 processors 1 processor
Analysis and Observations • Clustering possibilities • Temporal, host-based, geographical • Algorithms • Mr. Bayes and ProtML are most consistent in their performance • Too compute-intensive for the larger “macro” sets • Observed pattern • All phylograms yielded geographic-based clustering rather than time-based clustering • Host ranges along clustered clades vary • Same strain with identical NA sequences can infect different hosts • NA may not be the sole factor responsible for determining the diverse host range • Glycan site acquisition or loss seems to play a critical role in the molecular evolution of H5N1 NA • Identification of “bridging isolates” may help in rapid monitoring and development of global scale warning system for H5N1
Conclusion and Future Work • Quascade • New graphical data-flow tool to design automatically grid-enabled pipelines / workflows • Supports implicit high-performance parallelization • Supports persistent components • Can be used with Java / C/ C++ code or application-binaries • H5N1 Phyloinformatics • Can take advantage of workflow system and HPC • Can be easily used and modified by biologists • Use H5N1 NA sequences to better understand evolution of H5N1 • Analysis of H5N1 NA data with different algorithms indicates spatial clustering based on geographical distribution rather than temporal or host. • Future work • Studies in conjunction with other proteins such as HA, Polymerase etc., and also at gene and genome level