10 likes | 116 Views
Summary. Introduction. Performance. Funding. Score Combining Step. Output Generation Step. Application of Hadoop to Proteomic Searches. Steven Lewis 1 , John Boyle 1 , Attila Csordas 2 , Sarah Killcoyne 1.
E N D
Summary Introduction Performance Funding Score Combining Step Output Generation Step Application of Hadoop to Proteomic Searches Steven Lewis 1, John Boyle 1, Attila Csordas 2, Sarah Killcoyne 1 1Institute for Systems Biology, Seattle, Washington, USA. 2PRIDE Group Proteomic Services Team, PANDA Group EMBL European Bioinformatics Institute. SQL Database Population – Step 0 Fasta files are converted into tables in a SQL database. Peptides possibly with modifications are stored with the MZ ratio as an integer as a key. Normally databases need be generated infrequently since they can reused for any taxonomy. • Hadoop writes the output of each reducer into a separate file. In the last step all output is combined into a single bioml file (a future option is to write a pep.xml file directly) . This step uses a single reducer and writes the same bioml file output by X!Tandem • Input Key – Spectrum Id • Input Value – Scored Spectrum XML • A separate post processing step copies the output file to the user’s directory. Shotgun Proteomics involves large search problems comparing many spectra with possible peptides. As researchers apply modifications and consider alternate cleavages, the search space grows by a few orders of magnitude. Modern searches strain the resources of a single machine. We have an implementation which uses the Hadoop version of Google's Map-Reduce algorithm to search Proteomics databases. Scoring Step -Mapper • The mapper reads one or more mzml or mzXML files as input. The value presented to the mapper is the tag representing a single spectrum. Spectra are parsed and the xml tag representing a spectrum is sent with a series of keys representing MZ ratios for scoring • Input source – a file or directory holding mzml or mzXML files Map Reduce Algorithm Algorithm as an Interface • Map Reduce is an algorithm developed by Google for processing large data sets on massive clusters. The algorithm has been implemented by apache. • Data processing proceeds in two steps: Map and Reduce. In the map step a series of values are read and sent out as 0 or more key value pairs. After all maps complete processing, the keys are sorted and sent to a reduce step. In many cases including this one multiple Map-Reduce jobs are chained. Most of the infrastructure brings peptides and spectra together in the scoring reducer. The scoring algorithm is a small and interchangeable portion of that code. The architecture allows multiple algorithms – say Seaquest, K_Score and XTandem to be run in this step and combined in the output. Scoring Step -Reducer • Every spectrum is scored by multiple processes. The generated data identifies the spectrum and a list of peptides (with potential modifications) and their scores. In the next job all peptides scored against a single spectrum are sent to a single reducer to determine the best score (and a list of the N other highest scores). The MZ value passed in as a key is used to retrieve peptides to score against. • The output is an xml document (Scored Spectrum XML) holding the spectrum and the top scoring peptides • Input Key – Integer MZ value • Input Value – Spectrum tags from mzml • Output Key – Spectrum Id • Output Value – Scored Spectrum XML Running on a 10 node, 4 CPU cluster, the Hadoop job (15,000 proteins, 16,000 spectra) took 20 minutes compared to the same job running on X!Tandem on 8 hours on a 4 CPU single processor machine. We are in the process of testing against multiprocessor and alternate Hadoop streaming implementations of X!Tandem. The advantages of using the Hadoop framework are: the infrastructure is widely used and well tested, mechanisms for dealing with failures and retries are built into the framework, and resources may be expanded by simply expending the size of the cluster. Two other advantages of the specific algorithms are the ability to handle multiple algorithms in a single run and the use of databases and other caching • Mapper Contract • There is no guarantee of the order in which keys will be received or which map process will handle them.. • Reducer Contract • All items tagged with a specific key will be sent to a single reducer in a single step. • All keys sent to a specific reducer will be received in a known sort order. • Hadoop infrastructure Contract • Tasks will be distributed to processors in a ‘fair’ fashion. • Tasks which fail or run slowly will be restarted on another machine • Multiple failures will fail the entire job. • Hardware failures will be handled • Every spectrum is scored by multiple processes. The generated data identifies the spectrum and a list of peptides (with potential modifications) and their scores. In the next job all peptides scored against a single spectrum are sent to a single reducer to determine the best score (and a list of the N other highest scores). The scores in the passed objects may be combined to generate a ranked list of best scores. • Input Key – Spectrum Id • Input Value – Scored Spectrum XML • Output Key – Spectrum Id • Output Value – Scored Spectrum XML This project is supported by Award Number U24CA143835 from the National Cancer Institute.