Distributed Protein Structure Analysis

Distributed Protein Structure Analysis By Jeremy S. Brown Travis E. Brown

The Problem • An exhaustive search of proteins against a known database • Each string is between 400 and 600 characters long • Comparing a 10,000 strings against 1,000,000 random strings would take 44 days with a 1Ghz processor

Why an exhaustive search? • Initial intent was to analyze proteins to determine 3-Dimensional structure • Exhaustive search is required to ensure that the match that found is the best match

Solution • Distribute the search among many PCs to obtain an answer faster. • Solution raises more problems, however

How to Distribute • Distributing search strings is not enough • Also distribute the search space • Must find efficient way to distribute 1GB of data without duplication

Program Details • Client/Server architecture • Uses proprietary protocol over TCP/IP to distribute data • Server uses SQL database to store a list of ‘jobs’ and ‘known’ sequences

Program Details • Server issues a single ‘job’ to each client upon request • Client may also request a batch of data for comparison. • Server marks which data has been sent to clients and avoids resending that data to new clients

The Server

The Client

Problems • Server is slow at updating its database. • This is only seen once for each client however.

Performance Analysis • 1 client = 44 days (best algorithm) • 2 clients = 26 days • 3 clients = 19 days • Adding more than 1 client increases time almost linearly, though distribution is expensive

Graphs

Notes about the graphs • Graphs do not include initial distribution, since this is only done once per client • If search data distribution were to be included, efficiency would start at about 70% and increase to ~90% over time

Verifying Data Accuracy • Add an entry into the search table with a known score • Ensure that the result returned by the client is the known entry in the database

Lessons Learned • Reading and writing single elements to an SQL database can be very expensive. • Even the best designs aren’t perfect, especially when the problem is not fully understood.

Notes • Biggest problem was distribution of data • Distribution was very costly, so we tried to reuse data that we already distributed • Program is pluggable, so any comparison algorithm can be used

Question/Comments

Distributed Protein Structure Analysis

Distributed Protein Structure Analysis

Presentation Transcript

Protein Structure

Protein 3D-structure analysis

Protein 3D-structure analysis

PROTEIN STRUCTURE

Protein structure

PROTEIN STRUCTURE

Protein Structure

Protein Structure

Protein Structure

Protein Structure Prediction and Analysis

Protein Structure

Protein structure

Protein Structure

Protein Structure Analysis - II

Protein Structure

Protein Structure

Protein Structure

Structural Analysis of Protein Structure

Protein Structure Analysis

Protein structure

Protein Structure

Protein Structure Analysis - II