300 likes | 621 Views
WP13.2 Technical Feasibility Study. Assessment of supercomputing facilities for genomic data Sarah Hunter, EMBL-EBI. Genome projects. Rationale behind the TFS. The protein classification problem:. Protein classified. Hidden Markov Models. Unknown proteins. Rational behind the TFS.
E N D
WP13.2 Technical Feasibility Study Assessment of supercomputing facilities for genomic data Sarah Hunter, EMBL-EBI
Genome projects Rationale behind the TFS The protein classification problem: Protein classified Hidden Markov Models Unknown proteins
Rational behind the TFS • Rapid growth of sequence databases: • Original projection: by mid-2009, 60mln sequences • Main contributors in addition to genome sequencing: meta-genomics • Protein signature (HMM) databases also growing • “Peak-demand” calculations
Aims of the feasibility study • Determine the feasibility of running bioinformatics tasks at Supercomputing facilities For example: the HMMER algorithm for searching profile Hidden Markov Models (HMMs) representing conserved protein functional domains and families against protein sequence repositories to functionally classify them (e.g. InterPro) • Feasibility metrics: • performance (compute and network) • ease of set-up • ease of use • cost-effectiveness
WP13.2 Participants 3 Partners: • Sarah Hunter, Antony Quinn • Modesto Orozco, Josep Luís Gelpí, David Torrents, Romina Royo, Enric Tejedor • Tommi Nyronen, Marko Myllynen, Pekka Savola, Jarno Tuimala, Olli Tourunen, Samuli Saarinen
Hardware and software used Hardware • BSC “MareNostrum” • CSC gateway to Finnish M-Grid • CSC “Murska” cluster • EBI “Anthill” cluster Software • HMMER 2.3.2 • Grid Superscalar, SLURM, LSF, etc.
Data used HMM databases • SMART / PFAM – describe functional domains in proteins • PIRSF – describe conserved protein families • TIGRFAMs/HAMAP – describe conserved bacterial protein families • Superfamily/Gene3D – describe conserved structural domains Sequence databases • UniParc – archive of all known public protein sequences • UniProtKB – annotated public protein information resource • UniProtKB decoy – database of randomised protein sequences
Computational tasks performed • Benchmarking • Done to ensure results were consistent • Sequence files of different sizes (10, 5000, 10000 sequences) searched against all HMM databases • CSC produced optimised version of code (in collaboration with Åbo Akademien) • BSC discovered that altering size of jobs submitted (using superscalar) had impact on wall-clock time of total task. • Results have been presented at previous meetings • Comparable performance across all resources • Consistent outputs from searches
Computational tasks performed • Searching new meta-genomics sequences against HMM databases: searching large numbers of meta-genomic sequences (471,239) vs all HMMs • Building HMMs from Sequence alignments: hmmalign and hmmcalibrate used to build HMMs from sequences from the HAMAP protein families database • Running UniProtKB decoy database against HMM databases: searching a randomised protein sequence database using HMMPfam to see if false positives match • Calculation of MD5 checksums for all UniParc protein sequences: to test the performance of running very short jobs on the infrastructures. • Running against HMM databases on grid resources: 10000 sequences used in the other parts of the study were searched against Pfam 23.0 database using different grid configurations.
Computational tasks performed • Networking surveys between EBI/CSC and EBI/BSC using iperf: • CSC <-> EBI • ~ 300-500 Mbit/s • EBI <-> BSC: • ~ 900Mbit/s and 650Mbit/s
Metrics of success • Performance (compute and network) • Ease of set-up • Ease of use • Cost-effectiveness
Metrics of success - Performance • Benchmarking proves: • Accuracy of results are the same • Pure compute performance is equal to or slightly better than EBI • GRID initial feasibility study produced promising results • Network for these sorts of computational tasks not currently a bottleneck (could even be improved with tweaking)
Metrics of success – Ease of Set up • Difficulties with applying for compute • Hard to predict sequence/model volumes
Hard to predict sequence volumes Sequence archive database size at time of proposal writing
Hard to predict sequence volumes Projected database size at time of proposal writing
Hard to predict sequence volumes Actual eventual database size
Metrics of success – Ease of Set up • Longer compute periods preferable to short ones • Bigger buffer for unpredictability • Lower overhead of application and reporting
Metrics of success – Ease of Set up • Longer compute periods preferable to short ones • Bigger buffer for unpredictability • Lower overhead of application and reporting
Metrics of success - Ease of use • Took a while to get used to the different systems • Hard to predict compute load (even after piloting smaller-scale searches) • Caused disk contention issues • Wallclock time exceeded (jobs failed) • Ironing out performance issues required expert collaboration between Supercomputing centres and EBI
Metrics of success – cost-effectiveness • Hardest to measure accurately the overall cost of supplying such compute services to users like EBI • Hardware/Software/Electricity/Network relatively easy to estimate • Expert Personnel and administrators harder to estimate but critical for success • Would not be a static provider set-up but an evolving collaboration between 2 service providers (would need reflecting in any service-level agreements, for example)
Conclusions • Big data challenges were not covered by this TFS and these have different technical requirements • Expert help is absolutely necessary to set up and optimise compute tasks on any system but, once this is well-established, it is likely the amount of support required will reduce (until the next set of major schanges). It is important to clearly define how any collaboration will be organised between the EBI and a compute provider. • Economic costings for compute services is very difficult to measure, and depends greatly on not only the physical infrastructures put in place but also the expertise required to maintain and support it. • On-demand compute is required (i.e. translating to allocation of a larger number of compute hours over a longer period of time
Conclusions • Further work would be required to fully incorporate supercomputing centres' resources seamlessly into EBI production systems if SC facilities were to be utilised as a central component of the EBI production infrastructure. • Grid should be further investigated as a possible resource to be utilised. A workshop to bring together experts in the field and end-users is recommended. • New hardware paradigms (e.g. Cell-based processors and multi-core GPU and CPU systems) and the applications intending to be run on them, should be investigated thoroughly to ensure appropriate implementations • Cloud computing (e.g. Amazon EC2) is not appropriate for production as it cannot be that EBI is reliant on a profit-making, commercial entity for its core operations,
Conclusions • Is it feasible to running bioinformatics tasks at Supercomputing facilities? • Yes, with caveats • From the TFS, no indication of technical deal-breakers • Requires a fairly flexible attitude to allocation of compute to allow the responsiveness required for EBI’s demands • Should be viewed as an evolving collaboration • Costings are very hard to produce and therefore economic feasibility is difficult to measure