The Center for Computational Genomics and Bioinformatics

The Center for Computational Genomics and Bioinformatics Christopher Dwan Mike Karo Tim Kunau

Outline • Perspective • Processing tasks & requirements • Computational solutions • Interesting issues

Funding chart

The “Bioinformatics” component • “Pipeline” data processing and storage • 100Kb data • <5sec processing time • 10,000+ / month • The problem: Interface (batch & dependancy management) • Similarity search • Search against one or more ~10GB databases • The Problem: Data movement & memory • (much easier on dedicated resources)

The “bioinformatics” component • “Unigene” assembly • Traditional long run, big memory compute problem • Comes at the end of the other two types • The problem: algorithms • Clustering / Pattern Discovery • Conference driven • Causes us to redo the other tasks

The “bioinformatics” component • “Data warehouses” • Mirroring and cross checking other public resources • Local Oracle implementation of public databases for local users (Genbank / Swiss-PROT / Medicago …)

The “bioinformatics” component • Microarray data • Image data (~1MB per image) requires processing and storage • Unknown normalization, errors, etc. requires that we simply keep all the raw data. • Web based display of results • Visualization…

Computational resources • ~100 CPU Opportunistic Condor “Flock” • Not dedicated • Configuration can change without warning • No permanent local data storage • Machines sit on desks. • “flocking” with Madison, CS dept, other labs • Reciprocity can hurt a LOT. • Server farms • Intel / Alpha • Hard to find money to buy dedicated machines, esp. on single organism projects.

Software and user issues • An intuitive interface to parallel and batch systems gives uninformed users a great deal of power. • Tools from outside: Poor scalability • Tools from inside: Poor portability

Heuristic algorithms • Many bioinformatics tools are heuristic rather than complete searches. • These searches can return different results on different machines (dynamic thresholds, 32 vs. 64 bit math, …) • How do we tell “different” from “erroneous?”

Thank you: • The Condor team at Madison • Sanger Center

Collaborations are the key • Christopher Dwan cdwan@ahc.umn.edu • Mike Karo mek@ahc.umn.edu • Tim Kunau kunau@ahc.umn.edu

The Center for Computational Genomics and Bioinformatics