370 likes | 575 Views
GadFly. Building a Genome Annotation Database. What is it?. SQL Database Perl Objects Perl API Client Applications Analysis Pipeline. History. BFD Celera Annotation Jamboree GAME XML (Suzi, Erwin) Ensembl BioPerl GO. Data Stored. Sequence Analyses (genomic, cDNA, peptide)
E N D
GadFly Building a Genome Annotation Database
What is it? • SQL Database • Perl Objects • Perl API • Client Applications • Analysis Pipeline
History • BFD • Celera Annotation Jamboree • GAME XML (Suzi, Erwin) • Ensembl • BioPerl • GO
Data Stored • Sequence • Analyses (genomic, cDNA, peptide) • Genome Annotations • Gene Ontology
SQL Database Design • Generic Modeling • Simplicity • Abstraction • Extensibility/Evolvability • Heavily Normalised
Location Graphs (II) • Locations are transitive relationships • Locations can be transformed • e.g. gene loc in arm or contig coordinates • Linear transforms simple function • Nonliner transforms more difficult • eg if seq_feature to seq relationship involves splicing or translation
Minimal Graphs • Not necessary to store everything • e.g. Exon => Intron • e.g. Gene to translation implied • Arcs implied from spatial relationships • Some redundancy useful • Flexibility essential • Sets vs lists
Unlimited Possibilities • Evidence networks • TFs + binding sites • Intersection graphs • precompute cytology • insertions + gene features • Yeast 2 hybrid / P-P interactions • Similarity Graphs
GadFly Object Model • Objects: • in-memory representation • Inheritance • Gene is a kind of SeqFeature • Interfaces • bioperl/gbrowse • Methods and attributes • e.g. length(), get_seq(), start(), etc
GadFly perl API • How do we get/put objects? • Application Programmer Interface • Means of making requests about objects • Fetching Objects • database, file, XML, GFF • Putting Objects • database, file, XML, GFF, text, HTML • Adapters
API Requests • fetch all Genes that are transcription factors on 2L • write an annotated sequence to XML • fetch all the blastp results against human • find all sim4 hits to SD ESTs in the first megabase of 2L
Adapters • Objects are datasource-ignorant • Different In/Out adapters have different properties • No constraint on the number of database adapters • GadFly db: GxAdapters
Client Applications • flyshell • Web/CGI interface • multitude of scripts • pipeline • Apollo (kind of)
Future • intelligent denormalisation • Ontologies • GMOD • pan-flybase database (with Dave) • Data • other species, comparative, expression, proteomic • UI
Discussion • Object Models - the way to go? • language lock-in • insulated from db • complexity • Utilise DBMS more? • postgres: views, procedures • Ontologies + graph based systems?
Acknowledgements • BDGP • FlyBase • Ensembl - Ian, Ewan • WormBase - Lincoln Stein • In advance - • UC Davis • new folks