200 likes | 216 Views
Explore the challenges and solutions of distributing protein structure analysis, enhancing speed and efficiency. Learn about client/server architecture, performance analysis, data accuracy, and lessons learned in this comprehensive study.
E N D
Distributed Protein Structure Analysis By Jeremy S. Brown Travis E. Brown
The Problem • An exhaustive search of proteins against a known database • Each string is between 400 and 600 characters long • Comparing a 10,000 strings against 1,000,000 random strings would take 44 days with a 1Ghz processor
Why an exhaustive search? • Initial intent was to analyze proteins to determine 3-Dimensional structure • Exhaustive search is required to ensure that the match that found is the best match
Solution • Distribute the search among many PCs to obtain an answer faster. • Solution raises more problems, however
How to Distribute • Distributing search strings is not enough • Also distribute the search space • Must find efficient way to distribute 1GB of data without duplication
Program Details • Client/Server architecture • Uses proprietary protocol over TCP/IP to distribute data • Server uses SQL database to store a list of ‘jobs’ and ‘known’ sequences
Program Details • Server issues a single ‘job’ to each client upon request • Client may also request a batch of data for comparison. • Server marks which data has been sent to clients and avoids resending that data to new clients
Problems • Server is slow at updating its database. • This is only seen once for each client however.
Performance Analysis • 1 client = 44 days (best algorithm) • 2 clients = 26 days • 3 clients = 19 days • Adding more than 1 client increases time almost linearly, though distribution is expensive
Notes about the graphs • Graphs do not include initial distribution, since this is only done once per client • If search data distribution were to be included, efficiency would start at about 70% and increase to ~90% over time
Verifying Data Accuracy • Add an entry into the search table with a known score • Ensure that the result returned by the client is the known entry in the database
Lessons Learned • Reading and writing single elements to an SQL database can be very expensive. • Even the best designs aren’t perfect, especially when the problem is not fully understood.
Notes • Biggest problem was distribution of data • Distribution was very costly, so we tried to reuse data that we already distributed • Program is pluggable, so any comparison algorithm can be used