180 likes | 274 Views
Sordid Details of the Genome Browser. Totally retro technology Highly portable across browsers Fulfills the need for speed Some assembly required. Retro Design Choices. C isn’t so bad, really Universally available compilers Very fast run time Really nice debuggers
E N D
Sordid Details of the Genome Browser • Totally retro technology • Highly portable across browsers • Fulfills the need for speed • Some assembly required
Retro Design Choices • C isn’t so bad, really • Universally available compilers • Very fast run time • Really nice debuggers • CGI is portable, at least • Works with all web browsers • Works with all web servers • Not too hard on host if scripts are small and fast. • MySQL worth every penny • Technically, it’s free • Fast, simple SQL database
Language Warsepisode 23812 • Problems with C: • Char arrays aren’t nearly as nice as strings. • Have to check return values for error codes. • Uninitialized local and heap variables lead to hard to isolate bugs. • Problems with C++ • 8 stream classes, 4 string classes, and half the time you still have char arrays for strings. • Throw/Catch not working so well in GNU. • Uninitialized local and heap vars still lead to hard to isolate bugs. • Private info ends up in huge headers. • Er, which setX is getting called in this context? • Problems with Java • Microsoft plot to kill client side Java by incompatible extensions worked all too well. • Server side Java not quite mainstream in 1999.
Fixing Problems with C • Good library routines can make life with char arrays better. • setjmp/longjmp, atexit and resource tracking lists can make error handling relatively easy • errAbort(char *message,…); • pushAbortHandler(AbortHandler) • Heap memory at least can be initialized to zero • needMem(int size) • #define AllocA(varName) needMem(sizeof(varName)) • freez(&objectPointer);
Limited Object Orientation • A struct can generally act as an object. • Families of routines starting with the name of the object are like non-virtual methods, but more greppable. • struct dna *dnaNew(int size); • void dnaFree(struct dna **pDna); • void dnaCount(struct dna *dna, char base); • Virtual methods can be implemented by embedded function pointers. • All objects begin with a next pointer field so can be hung on a generic singly linked list. • Inheritance in wrong hands can destroy program locality worse than gotos.
Basic Module Structure • Library interfaces are in inc/*.hImplementations in lib/*.c • There are two libaries: • src/lib - older, more generic. 54 modules in all • src/hg/lib - newer, more human genome project specific. Requires mySQL to compile. 25 modules in all. • Programs are usually one or a few source files linked with libraries. • About 200 programs in all.
Library Utility Modules • common.h - basic stuff included in every program. Strings, files, error handling, singly linked lists. • hash.h - hash tables • linefile.h - line oriented and space/tab delimited file stuff. • bits.h - exciting arrays of bits • dlist.h - doubly linked lists • dystring.h - dynamically sized strings • localmem.h - fast local heap memory • portable.h - wrappers around things that vary between operating systems. • digraph.h - directed graphs.
Web Oriented Modules • cheapcgi.h - stuff to get variables and do other common chores for CGI scripts in C. • htmshell.h - stuff that makes it easier to write .html files, also heavily used by CGI scripts. • memgfx.h - draw on a 256 color bitmap in memory and save it as a GIF • hg/jksql.h - wrapper around MySQL interface with error handling and some shortcuts.
Biological Modules • xenalign.h - cross species aligner (pair HMM). • supStitch.h - fast large scale aligner for mRNA and other things with >95% base identity. • fuzzyFind.h - small scale aligner for mRNA and other things with >90% base identity. • dnautil.h - reverse complement, etc. • dnaseq.h - nucleotide sequence object. • fa.h - read/write Fasta files. • blastParse.h - read blast output. • psl.h - read/write psLayout alignments.
Important Programs • psLayout - Fast bulk alignment program for mRNA and other sequences with >95% sequence identity. • pslSort and pslReps - applies ‘near best in genome’ filter to alignments. • ooGreedy - Uses alignments and other data to assemble draft human genome. • waba - Cross species aligner. • ameme - DNA motif finder. • faNoise - add various types of noise to an .fa file. • ccCp - copy a file efficiently to all nodes in compute cluster • autoSql - generates C and SQL code from a data format specification.
The Browser • CGI script generates graphics on the fly as .gif file in temp dir. • Zooming and scrolling handled by link to same CGI script with different parameters. • Separate CGI script called to process most clicks. • Data is stored in MySQL database.
The Interactive Challenge • Need to bring up initial page in about 15 seconds, subsequent pages in about 5 seconds. • Precompute stuff on 100 machine cluster. • Database usually the bottleneck. • Scaled out view of chromosome 1 involves over 500,000 items. • Database design must minimize number of seeks needed to display a window. • Must sort data, not just index it. • Graphics also need to be snappy.
Anatomy of CGI • A CGI script essentially just prints a web page to stdout. • Web server knows if cgi-bin is part of URL to call a program to get the page rather than read a file. • Web ‘forms’ can pass data to CGI scripts. • CGI scripts can generate web forms. • Can embed images. • Image maps tell browser what URL to call when clicking on specific parts of an image. • A challenge - maintaining context between user clicks. (hidden vars)
Tracks - the central metaphor struct trackGroup/* Structure that displays a track. */{struct trackGroup *next; /* Next on list.*/char *mapName; /* Name on ui buttons. */enum visibility vis; /* Dense? Full? */char *longLabel; /* Label for center. */char *shortLabel; /* Label for left side */ void *items; /* Singly linked item list. */ … void (*loadItems)(struct trackGroup *tg);/* Load items, called before draw. */ void (*drawItems)(struct trackGroup *tg, struct memGfx *mg, int x, int y, ... enum visibility vis);/* Draw all items. */ char (*itemName)(struct trackGroup *tg, void *item);/* Return name of an item. */int (*totalHeight)(struct trackGroup *tg);/* Return height needed for all items. */ … };
Loading data in window • Open database and build a query:conn = sqlConnect(“hg3”);sprintf(query, “select * from ctgPos” “ where chrom = ‘%s’” “ and chromStart < %d” “ and chromEnd > %d”, winChrom, winStart, winEnd); • Query databasesr = sqlGetResult(conn, query); • Get results as array of stringswhile ((row = sqlNextRow(sr)) != NULL) • Use AutoSQL generated routine to convert to objectctg = ctgPosLoad(row); • Save on item list.slAddHead(&itemList, ctg);
Drawing Data • Loop through item listfor (ctg = items; ctg != NULL; ctg = ctg->next) • Scale item to windowx1 = scaleItem(ctg->chromStart);x2 = scaleItem(ctg->chromEnd);w = x2-x1; • Render itemmgDrawBox(mg, x1, y, w, height, color);mgTextCentered(mg, x1, y, w, height, color, ctg->name); • Advance to next line if full displayif (vis == tvFull) y += height; • Write box to image mapmapBox(x1, y, w, height, “ctgPos”, ctg->name);
Conclusions • Robust, simple, extensible, and fast design that works across web browsers. • Appropriate use of lagging edge technologies. • Write ups in Science and Nature. • >1000 users per day.
Acknowledgements • David Haussler - bold, charming, astute. A good teacher to boot. • Al Zahler - a kind and generous boss and a sharp biologist. • Paul Tatarsky - #1 system admin. • Scott, Nick, Terry, Patrick and Ewan - for all the programming. • Francis, Eric, Bob, and John - over 4 billion bases served.