640 likes | 762 Views
The Shocking Details of Genome.ucsc.edu. History of the Code. Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules include a Worm genome browser (Intronerator), and GigAssembler which produced working draft of human genome.
E N D
History of the Code • Started in 1999 in C after Java proved hopelessly unportable across browsers. • Early modules include a Worm genome browser (Intronerator), and GigAssembler which produced working draft of human genome. • In 2001 a few other grad students started working on the code. • In 2002 hired staff to help with Genome Browser • Currently project employs ~20 full time people.
The Genome Browser Staff • 5 programmers: Mark, Angie, Hiram, Kate, Rachel, Fan, Jim • 4 quality assurance engineers - Heather, Bob, Mike, Galt • 3 post-docs - Terry, Gill, Katie • 9 grad students - Chuck, Daryl, Brian, Robert, Yontao, Krish, Adam, Ryan, Andy • 3 system administrators - Paul, Jorge, Patrick • 1 writer - Donna • David Haussler and CBSE Staff • About 1/3 of staff (including me 3 days a week) telecommutes.
The Goal Make the human genome understandable by humans.
Prognosis Maybe we’ll understand it one of these days
Add Your Own Tracks • Users can extend the browser with their own tracks. • User tracks can be private or public. • No programming required. • GFF, GTF, PSL or BED formats supported #chrom start end [name strand score …] chr1 1302347 1302357 SP1 + 800 chr1 1504778 1504787 SP2 – 980
The Underlying Database • Power users and bioinformaticians sometimes want underlying database. • There is a table for each track. • Larger tracks have a table for each chromosome. • Format of a track table generally similar to add-your-own track formats. • Pieces of database available from ‘tables’ browser. • Whole database available as tab-separated files. • Most of database served via DAS.
Parasol and Kilo Cluster • UCSC cluster has 1000 CPUs running Linux • 1,000,000 BLASTZ jobs in 25 hours for mouse/human alignment • We wrote Parasol job scheduler to keep up. • Very fast and free. • Jobs are organized into batches. • Error checking at job and at batch level.
Coding: Discipline Is Required • While software development is immune from almost all physical laws, entropy his us hard. - The Pragmatic Programmer • To keep the system from devolving into disorder we have to follow code conventions and insist on a lot of testing. • We use CVS (concurrent version system) to help all of us work on the same code at once.
Obtaining the Code from CVS • See http://genome.ucsc.edu/admin/cvs.html • This gets you a ‘sandbox’ - a local copy of the source to compile and edit. • Type ‘make’ in the lib and utilities directory. • You can do a ‘cvs update’ to get our updates to the code base. • To add permanently to code base email me to enable ‘cvs commit’
Lagging Edge Software • C language - compilers still available! • CGI Scripts - portable if not pretty. • SQL database - at least MySQL is free.
Problems with C • Missing booleans and strings. • No real objects. • Must free things
Advantages of C • Very fast at runtime. • Very portable. • Language is simple. • No tangled inheritance hierarchy. • Excellent free tools are available. • Libraries and conventions can compensate for language weaknesses.
Coping with Missing Data Types in C • #define boolean int • Fixing lack of real string type much harder • lineFile/common modules and autoSql code generator make parsing files relatively painless • dyString module not a horrible string ‘class’
Object Oriented Programming in C • Build objects around structures. • Make families of functions with names that start with the structure name, and that take the structure as the first argument. • Implement polymorphism/virtual functions with function pointers in structure. • Inheritance is still difficult. Perhaps this is not such a bad thing.
struct dnaSeq /* A dna sequence in one-letter-per-base format. */ { struct dnaSeq *next; /* Next in list. */ char *name; /* Sequence name. */ char *dna; /* a’s c’s g’s and t’s. Null terminated */ int size; /* Number of bases. */ }; struct dnaSeq *dnaSeqFromString(char *string); /* Convert string containing sequence and possibly * white space and numbers to a dnaSeq. */ void dnaSeqFree(struct dnaSeq **pSeq); /* Free dnaSeq and set pointer to NULL. */ void dnaSeqFreeList(struct dnaSeq **pList); /* Free list of dnaSeq’s. */
struct screenObj /* A two dimensional object in a sleazy video game. */ { struct screenObj *next; /* Next in list. */ char *name; /* Object name. */ int x,y,width,height; /* Bounds of object. */ void (*draw)(struct screenObj *obj); /* Draw object */ boolean (*in)(struct screenObj *obj, int x, int y); /* Return true if x,y is in object */ void *custom; /* Custom data for a particular type */ void (*freeCustom)(struct screenObj *obj); /* Free custom data. */ }; #define screenObjDraw(obj) (obj->draw(obj)) /* Draw object. */ void screenObjFree(struct screenObj **pObj); /* Free up screen object including custom part. */
Naming Conventions • Code is constrained by few natural laws. • There are many ways to do things, so programmers make arbitrary decisions. • Arbitrary decisions are hard to remember. • Conventions make decisions less arbitrary. • varName vs. VarName vs varname vs var_name. We use varName. • variable vs. var vs. vrbl vs. vble vs varible: if you need to abbreviate, keep it short.
Commenting Conventions • Each module has a comment describing it’s overall purpose. • Each function also has an overall comment. • Each field in a structure has a comment. • Longer functions broken into ‘paragraphs’ that each begin with a comment. • The module, function, and structure comments are replicated in the .h file, which serves as an index to the module.
Error Handling • Code prints out a message and aborts (via the errAbort function) when there is a problem. • This saves loads of error handling code and is generally the right thing to do. • You can ‘catch’ an errAbort if necessary, though it rarely is.
Memory • Uninitialized memory leads to difficult bugs. • Compiler set to warn of uninitialized vars • Dynamic memory goes through needMem. It is always zeroed. • Memory usually freed with freez(), which sets pointer to null as well as freeing it. • ‘Careful’ memory handler can be pushed to help track down memory bugs: • Sentinal values to detect writing past end of array • Detects memory freed twice or not freed • Detects heap corruption in general.
Generally Useful Modules • String handling - common dystring wildcmp • Collections - common (singly linked lists), hash, dlist, binRange rbTree • DNA - dnautils dnaseq • Web - htmshell, cheapcgi, htmlPage • I/O - linefile, xap (XML), fa, nib, twoBit, blastParse, blastOut, maf, chain, gff • Graphics - memgfx, gifwrite, psGfx, vGfx
Anatomy of a CGI Script • Gets called by Web Server when user clicks submit or follows a cgi link. • Input is in environment variables and sometimes also stdin. Routines in cheapCgi move this to a hash table. • Output is to stdout. Routines in htmshell help with output formatting. • In the middle often access a database.
Challenges of CGI • Each click launches program anew. • User state can be kept in ‘cart’ variables • Run from Web Server, harder to debug • Use cgiSpoof to run from command line • Push an error handler that will close out web page, so can see your error messages. htmShell does this, but webShell may not…. • Ideally should run in less than 2 seconds.
Relational Databases • Relational databases consist of tables, indices, and the Structured Query Language (SQL). • Tables are much like tab-separated files: #chrom start end name strand score chr22 14600000 14612345 ldlr + 0.989 chr21 18283999 18298577 vldlr - 0.998Fields are simple - no lists or substructures. • Can join tables based on a shared field. This is flexible, but only as fast as the index. • Tables and joins are accessed a row at a time. • The row is represented as an array of strings.
Converting A Row to Object struct exoFish *exoFishLoad(char **row) /* Load a exoFish from row fetched with select * from exoFish * from database. Dispose of this with exoFishFree(). */ { struct exoFish *ret; AllocVar(ret); ret->chrom = cloneString(row[0]); ret->chromStart = sqlUnsigned(row[1]); ret->chromEnd = sqlUnsigned(row[2]); ret->name = cloneString(row[3]); ret->score = sqlUnsigned(row[4]); return ret; }
Motivation for AutoSql • Row to object code is tedious at best. • Also have save object, free object code to write. • SQL create statement needs to match C structure. • Lack of lists without doing a join can seriously impact performance and complicate schema.
AutoSql Data Declaration table exoFish "An evolutionarily conserved region (ecore) with Tetroadon" ( string chrom; "Human chromosome or FPC contig" uint chromStart; "Start position in chromosome" uint chromEnd; "End position in chromosome" string name; "Ecore name in Genoscope database" uint score; "Score from 0 to 1000" ) See autoSql.doc for more details. See also autoXml
Coding Conclusion • It’s always safer on the lagging edge • Consider redesigning system as COBOL character-based application
UCSC Gene Family Browser Expression and other information on genes in a big sorted, linked table