140 likes | 220 Views
autoconf and Biological Annotation Tool (BAT). Bob Zimmermann 6 September 2006. First, a Bit of a Digression (Look Familiar?). checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... no checking for mawk... no
E N D
autoconf and Biological Annotation Tool (BAT) Bob Zimmermann 6 September 2006
First, a Bit of a Digression (Look Familiar?) checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... no checking for mawk... no checking for nawk... no checking for awk... awk checking whether make sets $(MAKE)... yes checking build system type... powerpc-apple-darwin8.7.0 checking host system type... powerpc-apple-darwin8.7.0 checking for style of include used by make... GNU checking for gcc... gcc checking for C compiler default output file name... a.out . . .
So • How does everyone have nearly identical 50,000 line configure scripts? • Serendipity. • NO! autoconf • Why have such a hacky shell script? • assume the worst when building on other OSs: Solaris, *BSD, OS X, VMS (ugh.) • Many people are working on it
How does it work? • Write configure.in (or run autoscan) • aclocal; autoheader; autoconf
An Example AC_INIT(iscan, 3.5.0, brent@cse.wustl.edu) … AC_CHECK_LIB([m], [log]) PKG_CHECK_MODULES([GLIB], [ glib-2.0 >= 2.0.0 ], AC_MSG_RESULT([yes]), AC_MSG_RESULT([no])) AC_SUBST(GLIB_CFLAGS) AC_SUBST(GLIB_LDFLAGS) AC_CHECK_HEADERS([libgen.h fcntl.h float.h limits.h stdlib.h string.h unistd.h]) AC_DEFINE_UNQUOTED([BUILD],["`date +'%Y.%m.%d.%R'``whoami`"], [Id of the build for versioning purposes]) AC_CHECK_FUNCS([floor memset pow sqrt sprintf strerror strstr strtol])
An Example /* Id of the build for versioning purposes */ #define BUILD "2006.08.30.04:05rpz” /* Define to 1 if you have the `pow' function. */ #define HAVE_POW 1 /* Define to `unsigned' if <sys/types.h> does not define. */ /* #undef size_t */
I Want Makefiles Too! • OK: automake • Input a short Makefile.am and get a Makefile • Has targets clean, configure, dist, all • Can be built in any directory • Will adjust compiler flags based on results of configure • Can replace missing system calls • Can conditionally use libraries
Freaking Confusing, Bob • I’ll post a little crude guide on nijibabulu.org at some point.
BAT: Why? • I am doing experiments with large annotation files • Eval is slow and uses a lot of memory • Eval has a lot of features we like • A framework for parsing and analyzing annotations quickly and robustly is good • Acronyms.
The General Idea • We keep only one data structure hard coded: the BAT_Annotation • Parsing and writing are handled in plugins • Validation, evaluation, analysis are decoupled of parsing and writing • We keep a low profile for heavy computational tasks • Yes, there are some awful algorithms that go into annotation analysis
The Model BAT_Validator BAT_Evaluator starts gene comp gene_id … frame BAT_Writer BAT_Parser GTF GTF BAT_Annotation UCSC UCSC PSL PSL BAT_Actor cluster …
Example (Fake) use chr01.fa parse chr01.ucsc chr01 parse chr1.extra.gtf chr01 validate check_starts --delete validate gene_ids --cds-only … output chr1.eval.gtf --- act cluster_ests act make_estseq --output=chr01.estseq.fa etc.
Whats There Yet? • Implemented pluggable GTF parser, GTF writer • Implemented gene_id validator. • Benchmarks • Parse, validate and write chr2R.eval.gtf Eval:120MB, BAT: 30MB (down) • chr2R.preds.gtf Eval: 24+ hrs BAT: <2min • Compiles on OS X, Solaris, Linux, OpenBSD (autoconf!)
What Else Can It Be Good For? • Perl and Python bindings • Modest (constant) loss of efficiency • Parameter estimation • Zoe output of multiple formats • Target selection • Name a project involving annotations of any format.