360 likes | 483 Views
Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari. Nargess Memarsadeghi CMSC 838 Presentation. Talk Overview. Overview of talk Motivation Background Techniques Evaluation Related work Observations. Motivation: EST Clustering. Problem: EST Clustering Cluster fragments of cDNA
E N D
Parallel EST ClusteringbyKalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation
Talk Overview • Overview of talk • Motivation • Background • Techniques • Evaluation • Related work • Observations CMSC 838T – Presentation
Motivation: EST Clustering • Problem: EST Clustering • Cluster fragments of cDNA • Related to ‘fragment assembly’ problem • Detecting overlapping fragments • Overlaps can be computed: • Pairwise alignment algorithm • Dynamic programming • Alternative: • Approximate overlap detection algorithms • Dynamic programming CMSC 838T – Presentation
Motivation • Common Tools: • Takes too long • Days for 100,000 ESTs • Runs out of memory • This paper: • PaCE: • Parallel Clustering of ESTs • Efficient parallel EST Clustering • Space efficient algorithm • Reduce total work • Reduce run-time CMSC 838T – Presentation
Background: EST Clustering Tools • Three traditional software: • Originally designed for fragment assembly: • TIGR Assembler • Phrap • CAP3 • One parallel software: • UICLUSTER: assumes EST’s from 3’ end CMSC 838T – Presentation
EST Clustering Tools • Basic approach • Find pairs of similar sequences • Align similar pairs • Dynamic programing • Quality of EST clustering • Phrap: Fastest • avoids dynamic programming • Relies on approximation, lower quality • CAP: Least # of erroneous clusters CMSC 838T – Presentation
EST Clustering Tools’ Performance • With 50,000 maize ESTs • Using PC with dual Pentium 450MHZ , 512 RAM : • TIGR: ran out of memory • Phrap: 40 min • CAP: > 24 hours • With 100,000 maize ESTs • all ran out of memory • CAP would require 4 days CMSC 838T – Presentation
Goal • Space efficient algorithm • Space requirement linear in the size of the input data set • Reduce total work • Without sacrificing quality of clustering • Reduce run-time and facilitate the clustering of large data sets • Through parallel processing • Scale memory with # of processors CMSC 838T – Presentation
Approach • Expense: • Pairwise alignment (time + memory) • Promising pairs ≈ • Common string: |s|= w • Cost: if common |s|=l > w , then repeats l-w+1 times CMSC 838T – Presentation
Approach (Cont ..) • Approach: • Use trie structure • Identify promising pairs • Merge clusters with strong overlaps • Avoid storing/testing all similar pairs • Parallel EST Clustering Software: • Generalized Suffix Tree (GST) • Multiple processors: • Maintain and updates EST Clusters • Others generate batches of promising pairs, perform alignment CMSC 838T – Presentation
Approach (Cont …) CMSC 838T – Presentation
Tries • Index for each char • N leaves • Height N CMSC 838T – Presentation
Suffix Tries (Cont ..) • TRIM suffix trie CMSC 838T – Presentation
Suffix Tries (Cont ..) • Indicies • Storage O(n), constant is high though • Common string • Longest common substring CMSC 838T – Presentation
Suffix Tries (Cont ..) a b 5 b $ a a $ b b $ 4 $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern. CMSC 838T – Presentation
Parallel Generation of GST • GST: Generalized Suffix Tree • Compacted trie • Longest common prefix found in constant time • Used for on-demand pair generation • Sequential: O(nl) • Parallel: O(nl/p) CMSC 838T – Presentation
Parallel Generation of GST (Cont …) • Previous implementations: • CRCW/CREW PRAM model • Work-optimal • Involves alphabetical ordering of characters • Unrealistic assumptions • synchronous operation of processors • infinite network bandwidth • no memory contention • Not practically efficient CMSC 838T – Presentation
Parallel Generation of GST (Cont …) • Paper’s approach: • EST’s equally distributed among processors • Each processor • Partitions suffixes of ESTs into buckets • Distribute buckets to the processors: • All suffixes in a bucket allocated to the same processor • Total # of suffixes allocated to a processor ≈ O ( ) CMSC 838T – Presentation
Parallel Generation of GST (Cont …) • Each bucket’s processor: • Compute compacted trie of all its suffixes • Cannot use sequential construction • Suffixes of a string • not in the same bucket • Each bucket: • Subtree in the GST • Nodes: • Depth first search traversal of the trie • Pointer to the right most child CMSC 838T – Presentation
On-demand Pair Generation • A pair should be generated if • Share substring of length ≥ treshhold • Maximal • Leaves in a common node • Share a substring of length = depth of node • Parallel algorithm • Each processor works with its trie if • Depth of its root in GST < threshhold CMSC 838T – Presentation
On-demand Pair Generation • To process • Sort internal nodes • Decreasing order of depth • Lists of a node • Generated after process • Removed after parent is processed • Limits space O(nl) • Run time ≈ # pairs generated + cost of sorting • Rejected pairs increase run-time by a factor of 2 • Eliminating duplicates reduce run-time CMSC 838T – Presentation
Parallel Clustering • Master-Slave paradigm: • Master processor: • Maintains and updates clusters • Using union-find data structure • Receives messages from slave processors • A batch of next promising pairs generated by slave • Results of the pairwise alignment • Determines which ones to explore • Determines if merging should occur • Slave processors: • Generate pairs on demand • Perform pairwise alignments of pairs dispatched by the master processor CMSC 838T – Presentation
Parallel Clustering (Cont…) Organization of Parallel Clustering Software • Batch of promising pairs generated + results of pairwise alignment • Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair Slave P Master P Slave P slave P CMSC 838T – Presentation
Parallel Clustering (Cont..) • To start: • Slave P starts with 3× batchsize pairs • Sends the 3rd batch to Master P • Starts alignment on 1st batch • Sends results on 1st + a newly generated batch • While waiting to receive results from Master P, aligns 2nd batch • Processor always has the next batch to work between: • Submitting the results of previous batch • Receiving another set of pairs CMSC 838T – Presentation
Parallel Clustering (Cont..) • Improve and control quality • Parameters: • Match and mismatch scores • Gap penalties • Post processing: • Detection of alternating splicing • Consulting protein databases • Organism specific CMSC 838T – Presentation
Experimental environment • Used C and MPI • Tested • Quality of software: • Arabidopsis thaliana (due to availability of its genome) • Run-time behavior: • 50,000 Maize ESTs with 32-processor IBM SP • # of processors • Data size • (# of Promising pairs) vs data size • Batchsize vs (# processors) • # of Clusters • Master processor’s time CMSC 838T – Presentation
Quality Assessment • To asses quality • A data set and its correct clustering • ESTs from plant Arabidopsis thaliana • Splice program • Align ESTs to the genome • Discard ESTs that • Don’t align • Aligned in multiple spots CMSC 838T – Presentation
Quality Assessment (Cont …) • False negative: • A pair in correct clustering is not paired in the output • 5% • False positive: • A pair not in correct clustering appears in results • Negligible (< 0.04%) • Due to conservative nature of algorithm CMSC 838T – Presentation
Quality Assessment Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs. CMSC 838T – Presentation
Quality Assessment (Cont..) CMSC 838T – Presentation
Run-time Assessment • Experiment with 50,000 maize ESTs: • 32-processor IBM SP-2 • 16 minutes CMSC 838T – Presentation
Run-time Assessment (Cont …) Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors. CMSC 838T – Presentation
Run-time Assessment (Cont ..) • Run-time as a function of batchsize • Small batchsize • Increase in communication overhead • Large batchsize • Slaves less responsive to the need of generating pairs • Slave does not use latest clustering results • Optimal batchsize • Determined by experiment • Master processor’s time • Fixed batchsize, increase in # of processors • Gradual increase in Master P’s time • With 32 processors, increase < 1% • Using 1 Master Processor in not bottleneck CMSC 838T – Presentation
Results • Space Linear in size of the input data set • Reduced total work without sacrificing quality • Reduced run-time • Parallel processors • Eliminating pairs • Faciliate clustering • Scale memory with # Processors CMSC 838T – Presentation
Observations • PaCE: Approaches EST clustering problem directly • Better than • CAP3 • Phrap • TIGR Assembler • Compare time/quality • TIGICL (TIGR Indices Clustering Tool) • Support for PVM • MegaBlast • STACK • Large data sets • Lots of Processors • Can improve clustering time? • Clustering algorithm CMSC 838T – Presentation
References • http://www.cs.berkeley.edu/~kubitron/courses/cs258-S02/lectures/eval10-logp.pdf • Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988. CMSC 838T – Presentation