1 / 18

Sordid Details of the Genome Browser

Sordid Details of the Genome Browser. Totally retro technology Highly portable across browsers Fulfills the need for speed Some assembly required. Retro Design Choices. C isn’t so bad, really Universally available compilers Very fast run time Really nice debuggers

teness
Download Presentation

Sordid Details of the Genome Browser

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sordid Details of the Genome Browser • Totally retro technology • Highly portable across browsers • Fulfills the need for speed • Some assembly required

  2. Retro Design Choices • C isn’t so bad, really • Universally available compilers • Very fast run time • Really nice debuggers • CGI is portable, at least • Works with all web browsers • Works with all web servers • Not too hard on host if scripts are small and fast. • MySQL worth every penny • Technically, it’s free • Fast, simple SQL database

  3. Language Warsepisode 23812 • Problems with C: • Char arrays aren’t nearly as nice as strings. • Have to check return values for error codes. • Uninitialized local and heap variables lead to hard to isolate bugs. • Problems with C++ • 8 stream classes, 4 string classes, and half the time you still have char arrays for strings. • Throw/Catch not working so well in GNU. • Uninitialized local and heap vars still lead to hard to isolate bugs. • Private info ends up in huge headers. • Er, which setX is getting called in this context? • Problems with Java • Microsoft plot to kill client side Java by incompatible extensions worked all too well. • Server side Java not quite mainstream in 1999.

  4. Fixing Problems with C • Good library routines can make life with char arrays better. • setjmp/longjmp, atexit and resource tracking lists can make error handling relatively easy • errAbort(char *message,…); • pushAbortHandler(AbortHandler) • Heap memory at least can be initialized to zero • needMem(int size) • #define AllocA(varName) needMem(sizeof(varName)) • freez(&objectPointer);

  5. Limited Object Orientation • A struct can generally act as an object. • Families of routines starting with the name of the object are like non-virtual methods, but more greppable. • struct dna *dnaNew(int size); • void dnaFree(struct dna **pDna); • void dnaCount(struct dna *dna, char base); • Virtual methods can be implemented by embedded function pointers. • All objects begin with a next pointer field so can be hung on a generic singly linked list. • Inheritance in wrong hands can destroy program locality worse than gotos.

  6. Basic Module Structure • Library interfaces are in inc/*.hImplementations in lib/*.c • There are two libaries: • src/lib - older, more generic. 54 modules in all • src/hg/lib - newer, more human genome project specific. Requires mySQL to compile. 25 modules in all. • Programs are usually one or a few source files linked with libraries. • About 200 programs in all.

  7. Library Utility Modules • common.h - basic stuff included in every program. Strings, files, error handling, singly linked lists. • hash.h - hash tables • linefile.h - line oriented and space/tab delimited file stuff. • bits.h - exciting arrays of bits • dlist.h - doubly linked lists • dystring.h - dynamically sized strings • localmem.h - fast local heap memory • portable.h - wrappers around things that vary between operating systems. • digraph.h - directed graphs.

  8. Web Oriented Modules • cheapcgi.h - stuff to get variables and do other common chores for CGI scripts in C. • htmshell.h - stuff that makes it easier to write .html files, also heavily used by CGI scripts. • memgfx.h - draw on a 256 color bitmap in memory and save it as a GIF • hg/jksql.h - wrapper around MySQL interface with error handling and some shortcuts.

  9. Biological Modules • xenalign.h - cross species aligner (pair HMM). • supStitch.h - fast large scale aligner for mRNA and other things with >95% base identity. • fuzzyFind.h - small scale aligner for mRNA and other things with >90% base identity. • dnautil.h - reverse complement, etc. • dnaseq.h - nucleotide sequence object. • fa.h - read/write Fasta files. • blastParse.h - read blast output. • psl.h - read/write psLayout alignments.

  10. Important Programs • psLayout - Fast bulk alignment program for mRNA and other sequences with >95% sequence identity. • pslSort and pslReps - applies ‘near best in genome’ filter to alignments. • ooGreedy - Uses alignments and other data to assemble draft human genome. • waba - Cross species aligner. • ameme - DNA motif finder. • faNoise - add various types of noise to an .fa file. • ccCp - copy a file efficiently to all nodes in compute cluster • autoSql - generates C and SQL code from a data format specification.

  11. The Browser • CGI script generates graphics on the fly as .gif file in temp dir. • Zooming and scrolling handled by link to same CGI script with different parameters. • Separate CGI script called to process most clicks. • Data is stored in MySQL database.

  12. The Interactive Challenge • Need to bring up initial page in about 15 seconds, subsequent pages in about 5 seconds. • Precompute stuff on 100 machine cluster. • Database usually the bottleneck. • Scaled out view of chromosome 1 involves over 500,000 items. • Database design must minimize number of seeks needed to display a window. • Must sort data, not just index it. • Graphics also need to be snappy.

  13. Anatomy of CGI • A CGI script essentially just prints a web page to stdout. • Web server knows if cgi-bin is part of URL to call a program to get the page rather than read a file. • Web ‘forms’ can pass data to CGI scripts. • CGI scripts can generate web forms. • Can embed images. • Image maps tell browser what URL to call when clicking on specific parts of an image. • A challenge - maintaining context between user clicks. (hidden vars)

  14. Tracks - the central metaphor struct trackGroup/* Structure that displays a track. */{struct trackGroup *next; /* Next on list.*/char *mapName; /* Name on ui buttons. */enum visibility vis; /* Dense? Full? */char *longLabel; /* Label for center. */char *shortLabel; /* Label for left side */ void *items; /* Singly linked item list. */ … void (*loadItems)(struct trackGroup *tg);/* Load items, called before draw. */ void (*drawItems)(struct trackGroup *tg, struct memGfx *mg, int x, int y, ... enum visibility vis);/* Draw all items. */ char (*itemName)(struct trackGroup *tg, void *item);/* Return name of an item. */int (*totalHeight)(struct trackGroup *tg);/* Return height needed for all items. */ … };

  15. Loading data in window • Open database and build a query:conn = sqlConnect(“hg3”);sprintf(query, “select * from ctgPos” “ where chrom = ‘%s’” “ and chromStart < %d” “ and chromEnd > %d”, winChrom, winStart, winEnd); • Query databasesr = sqlGetResult(conn, query); • Get results as array of stringswhile ((row = sqlNextRow(sr)) != NULL) • Use AutoSQL generated routine to convert to objectctg = ctgPosLoad(row); • Save on item list.slAddHead(&itemList, ctg);

  16. Drawing Data • Loop through item listfor (ctg = items; ctg != NULL; ctg = ctg->next) • Scale item to windowx1 = scaleItem(ctg->chromStart);x2 = scaleItem(ctg->chromEnd);w = x2-x1; • Render itemmgDrawBox(mg, x1, y, w, height, color);mgTextCentered(mg, x1, y, w, height, color, ctg->name); • Advance to next line if full displayif (vis == tvFull) y += height; • Write box to image mapmapBox(x1, y, w, height, “ctgPos”, ctg->name);

  17. Conclusions • Robust, simple, extensible, and fast design that works across web browsers. • Appropriate use of lagging edge technologies. • Write ups in Science and Nature. • >1000 users per day.

  18. Acknowledgements • David Haussler - bold, charming, astute. A good teacher to boot. • Al Zahler - a kind and generous boss and a sharp biologist. • Paul Tatarsky - #1 system admin. • Scott, Nick, Terry, Patrick and Ewan - for all the programming. • Francis, Eric, Bob, and John - over 4 billion bases served.

More Related