1 / 40

Prof Dan Geiger, CBL,Technion Prof Assaf Schuster, DSL,Technion

Superlink-Online: Disease Gene Hunting Using Parallel Computing Over Hierarchy of Grids. Prof Dan Geiger, CBL,Technion Prof Assaf Schuster, DSL,Technion Genetics Research Institutes in Israel, EU, US Prof Miron Livny, Condor, Madison. Purpose of disease gene hunting. Why Search ?

jamesfox
Download Presentation

Prof Dan Geiger, CBL,Technion Prof Assaf Schuster, DSL,Technion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Superlink-Online: Disease Gene Hunting Using Parallel Computing Over Hierarchy of Grids Prof Dan Geiger, CBL,Technion Prof Assaf Schuster, DSL,Technion Genetics Research Institutes in Israel, EU, US Prof Miron Livny, Condor, Madison MSR HPC - Superlink Online

  2. Purpose of disease gene hunting • Why Search ? • Detection of diseases before birth • Risk assessment and corresponding life style changes • Finding the mutant proteins and developing medicine • Understanding basic biological functions • How to Search ? • Find families segregating the disease (linkage analysis) or collect unrelated healthy and affected persons (Association analysis or LD mapping) • Take a simple blood test from some individuals • Analyze the DNA in the lab • Compute the most likely location of disease gene MSR HPC - Superlink Online

  3. Usage of our system in Israeli Hospitals • Rabin Hospital, by Motti Shochat’s group • New locus for mental retardation (2003) • Infantile bilateral striatal necrosis (2004) • Soroka Hospital, by Ohad Birk’s group • Lethal congenital contractural syndrome (2004) • Congenital cataract (2005) • Rambam Hospital, by Eli Shprecher’s group • Congenital recessive ichthyosis (2005) • CEDNIK syndrome (2005) • Galil Ma’aravi Hospital, by Tzipi Falik’s group • Familial Onychodysplasia and dysplasia • Familial juvenile hypertrophy (2005) MSR HPC - Superlink Online

  4. Identifygenes(104~105 bp) Resequencing(100 bp) Steps in Gene Hunting Linkageanalysis(106~107 bp) MSR HPC - Superlink Online

  5. Male or female Recombinantgametes Recombination During Meiosis MSR HPC - Superlink Online

  6. Family Pedigree MSR HPC - Superlink Online

  7. Familial Onychodysplasia and dysplasia of distal phalanges (ODP) III-15 IV-10 IV-7 MSR HPC - Superlink Online

  8. . M1 M2 Chromosome pair: Marker Information Added MSR HPC - Superlink Online

  9. M1 M2 D1 D2 M3 M4 θ III-15 151,159 III-16 151,155 a h 202,209 202,202 139,141 139,146 1,2 3,3 Maximum Likelihood Evaluation The computational problem: find a value of θ maximizing Pr(data|θ) LOD score (to quantify how confident we are): Z(θ)=log10[Pr(data|θ) / Pr(data|θ=½)]. MSR HPC - Superlink Online

  10. Results of Multipoint Analysis MSR HPC - Superlink Online

  11. The Bayesian network model Si3f Li2f y2 Xi2 Li2m Li3f Xi3 Li3m Y3 Li1f Xi1 Y1 Li1m Si3m Locus 2 (Disease) Locus 3 Locus 4 Locus 1 This model depicts the qualitative relations between the variables. We need also to specify the joint distribution over these variables. MSR HPC - Superlink Online

  12. Finding the best order is equivalent to finding the best order for sum-product operations for high dimensional matrices: The Computational Task • Computing Pr(data|θ) for a specific value of θ: • Exponential time and space in: • #variables • five per person • #markers • #gene loci • #values per variable • #alleles • non-typed persons • table dimensionality • cycles in pedigree MSR HPC - Superlink Online

  13. Automatic Parallelization – Variable Conditioning Parallelization overhead – non trivial MSR HPC - Superlink Online

  14. Basic unit of execution – batch job Non-interactive mode: “enqueue – wait – execute – return” Self-contained execution sandbox Weak (lack of) Quality of Service Random failures of execution machines Hardware bugs may lead to incorrect results Potentially unbounded execution/queue waiting time Dynamic/abrupt changes of resource availability High network delays (communication over WAN) Using grids/resource pools MSR HPC - Superlink Online

  15. The system must be geneticists-friendly On-line access through simple WEB interface Interactive experience Low response time for short tasks Fast computation of previously infeasible long tasks via parallel execution Prompt user feedback Secure, reliable, stable, overload-resistant Allow submission of tasks by multiple users Requirements MSR HPC - Superlink Online

  16. 2-4 CPUs Dedicated server 1000s CPUs 100000s CPUs Clusters of Workstations 1000000s CPUs Computational Grids (EGEE-II, OSG) Community Grids (SETI@HOME, FOLDING@HOME) Execution platforms • Tradeoff: •  Larger grids provide more resources, but •  Execution overhead grows with the grid size: • more complex resource management • higher resource volatility • higher network overheads • higher security overheads • rigid user policies Complicated scheduling software stack (EGEE-II example) Submit ► Local UI ► Resource Broker ► Local cluster ► Execute Scheduling delay on EGEE-II: *** Between mins to days *** MSR HPC - Superlink Online

  17. Task length distribution Task length unknown upon submission From seconds to milleniums Estimating task length? NP hard <3minuts <2hours <2days <2weeks <3months >3months MSR HPC - Superlink Online

  18. Problem: delays of short tasks Try 1: Single FIFO queue 2x2+1=? 2x2+1=? 2x2+1=? MSR HPC - Superlink Online

  19. Problem: task runtime not known Try 2: Priority queues 2 2x2+1=? 2x2+1=? 2x2+1=? 2 2x2+1=? MSR HPC - Superlink Online

  20. Problem: short tasks may be scheduled on high-latency resources Try 3: Multi-Level Feedback Queue 2x2+1=? 2 2x2+1=? 2x2+1=? 2 2x2+1=? MSR HPC - Superlink Online

  21. The Grid Execution Hierarchy [HPDC’06] Submission server Waterfall principle: If task seems too hard • can take more overhead • move to a lower level • Need: • Task-length estimation • Divisible tasks (variable conditioning ) • Migration to larger pools Dedicated server Clusters of Workstations Computational Grids (EGEE-II) Community Grids (Superlink@HOME) MSR HPC - Superlink Online

  22. Task complexity easy to derive from variable elimination order Same goes for space requirments Except for parallelisation overhead Best elimination order is hard The longer you work on it the better it gets Can be trivially parallelized Upon entering a new pool Refine bounds by improving order Until reordering time reaches a certain fraction of expected runtime, or of pool time limit heuristics If (expected runtime + current load > pool time limit) move to a larger pool Divide tasks. Work. If pool time limit expired (runtime is longer than expected) move to a larger pool Task Length Estimation MSR HPC - Superlink Online

  23. Input validation Reporting logic Web server DB Submission logic FS FS Task Queue Handler Notification Handler Accept logic System Health Monitor Queue 1 Task execution logic (per task) Migration handler Batch system queue Results Single Pool System Health Monitor Network via SSH

  24. Input validation Reporting logic Web server System Health Monitor DB Submission logic FS FS FS FS FS Notification Handler Notification Handler Notification Handler System Health Monitor System Health Monitor System Health Monitor Queue i Queue i Queue i Batch system queue Batch system queue Batch system queue Four Pools Output Input Pool 4 Output Pool 1 Output Task migration Output Task migration Pool 2 Pool 3 Task migration Notification Handler System Health Monitor Queue i Batch system queue

  25. Task execution logic (simplified) First complexity estimation Reject Simple task • Condor DAGman manages the execution flow • Automatic restart after failure • Migration at any step of the DAG • Checkpoint and stop DAGman • Archive task data together with DAGman state • Move to another pool • Restart there Done Second complexity estimation Reject Retry if failed J1 J2 Jn Parallelization Reject Retry if failed J2 J1 Jk Reject Done

  26. Web Server Current deployment Flock of two Technion Condor pools ~300CPUs Dedicated server SUPERLINK@HOME Q2 Q1 Q3 Q4 Q5 Reject Reject EGEE-II ~15,000 CPUs Flock of three UW Condor pools ~3500 CPUs MSR HPC - Superlink Online

  27. Superlink-online portal MSR HPC - Superlink Online

  28. Task Submission MSR HPC - Superlink Online

  29. User submits her data for analysis No specification of running time or parallelization Secure web interface Monitoring partial results/running time expectations Cancellation E-mail notifications on important events Completion/Error/System failure Behind the scene Task runtime estimation and parallelization System monitoring, failure recovery Scheduling Superlink-online Experience MSR HPC - Superlink Online

  30. ~110 CPUyears for ~5600 tasks. Several mutated genes found Israeli and international users Soroka H., Be'er Sheva, Galil Ma'aravi H., Nahariya, Rabin H., Petah Tikva, Rambam H., Haifa, Beney Tzion H., Haifa, Sha'arey Tzedek H., Jerusalem, Hadassa H., Jerusalem, Afula H. NIH, Universities and research centers in US, France, Germany, UK, Italy, Austria, Spain, Taiwan, Australia, and others... Task complexity 250 days on single CPU -> 7 hours on ~300-700 CPUS Short tasks: few seconds even during severe overload Statistics 2006 MSR HPC - Superlink Online

  31. Since January 1st. > 64 different user institutes > 3000 tasks > 60 cpu years Many rejects! Statistics 2007 MSR HPC - Superlink Online

  32. Performance on real datasets MSR HPC - Superlink Online

  33. Accumulated running time in pools (L3) (L2) (L2) (L1) MSR HPC - Superlink Online Tq=180 Tq=9600 Tq=172800

  34. Tasks handled by each level vs. CPU consumption by each pool MSR HPC - Superlink Online

  35. “Pool spilling” MSR HPC - Superlink Online

  36. Pool Occupancy (~400 Tasks) MSR HPC - Superlink Online

  37. Run times of Superlink bioinfo.cs.technion.ac.il/superlink-online MSR HPC - Superlink Online

  38. Must leave upper pools empty to allow fast response for short tasks Long tasks cannot use empty upper pools Granularity tradeoff for automatic parallelization Short jobs do not loose much in case of kill/failure up to 30% in EGEE Scheduling overhead motivates long jobs Issues with current system MSR HPC - Superlink Online

  39. Send thin clients as “place holders” to remote resources Similar to Condor glide-ins Build self-owned, dedicated “sub-grid” Perform application-specific scheduling on “captured” resources Scheduling overhead from hours down to seconds Dynamic/adaptive job granularity 500 cpu years in 2 months on EGEE Use BOINC Single user, flat, master-worker infrastructure Scalable Berkeley Open Infrastructure for Network Computing Creating Dedicated Grids(in testing) MSR HPC - Superlink Online

  40. QUESTIONS??? Visit us at: http://bioinfo.cs.technion.ac.il/superlink-online MSR HPC - Superlink Online

More Related