430 likes | 512 Views
Decoding the genome with perl, XML and SQL. Chris Mungall Howard Hughes Medical Institute UC Berkeley Lawrence Berkeley Lab. Outline. hacking the genome Data::Stag nested tag-value data and XML DBIx::DBStag Relational to XML mapping. CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT
E N D
Decoding the genome with perl, XML and SQL Chris Mungall Howard Hughes Medical Institute UC Berkeley Lawrence Berkeley Lab
Outline • hacking the genome • Data::Stag • nested tag-value data and XML • DBIx::DBStag • Relational to XML mapping
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT TTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCG CGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTA ATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGC GTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCT CTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGA GAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAG AGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTG TCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAA TATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTT TTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGT TGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATAT ATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCT CTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACA MVADERDDNIFKLPLIRNKFIATLNTAQYM IAYARVVPTYCANECLSFSVLYYRKPKKTI HDREREQRRYLDCLLNMIAYARVVPTYCAL YIMTASHSVLFYRKPKSTMHDRGSRTDI..
Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S The generic genome browser: a building block for a model organism system database. Genome Research 2002 12:1599-1610.
package Bio::Gene; use base qw(Bio::Entity); =head2 symbol Usage: $gene->symbol(“HGNC”); $symbol = $gene->symbol; =cut sub symbol { my $self = shift; $self->{_symbol} = shift if @_; return $self->{_symbol}; }
package Bio::Gene; use base qw(Bio::Entity); use Class::MethodMaker [ scalar => [qw/symbol taxon/], array => [qw/phenotypes functions positions/] ]; package Bio::Taxon; use base qw(Bio::Entity); use Class::MethodMaker [ scalar => [qw/id common_name genus species parent_node/], ];
Perl meta-object systems • eg Class::MethodMaker • fun! • Fine for ‘hermetic’ perl • but how do we interoperate with • other languages • databases • XML • No way of “querying” objects
Data::Stag • Objects and XML • perl objects are D(A)G-ish • XML is tree like • Why not merge the two? • YAML • alternate exchange format with perl-ish semantics • Stag • nested tag-value pairs • perl objects with XMLish semantics • XML-centric
<dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene> …
my $dataset = Data::Stag->parse(“genes.xml”); my $gene_set = $dataset->get_gene_set; foreach my $gene ($gene_set->get_gene) { next unless $gene->sget_tax_id == 9606; printf “%s\n”, $gene->sget_symbol; printf “ %s\n”, foreach $gene->get_phenotype; } <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene>
Stag trees • Recursive data structure • node-labeled tree • Implemented as nested lists • $node = [$name,$value] • $value = DATA or list of $nodes • Equivalent to element-only XML
Stag accessors • Tag-value pairs • $fooval = $obj->get_FOO; # wantarray=false • $fooval = $obj->sget_FOO; • @foovals = $obj->get_FOO; # wantarray=true • $obj->set_FOO($fooval) • No distinction between single and multi valued tags • ordered (no hashes)
my $dataset = Data::Stag->parse(“genes.xml”); my @symbols = $dataset->find_symbol; <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene>
my $dataset = Data::Stag->parse(“genes.xml”); my @symbols = $dataset->get(‘gene_set/gene/symbol’); <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene>
my $dataset = Data::Stag->parse(“genes.xml”); my @genes = $dataset->where(‘gene’, sub { shift->sget(‘position/start’) < 5000}); <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene>
Attributes and mixed elements <gene id=“HGNC”> <desc> implicated in <b>Hemochromatosis</b> </desc> </gene> <gene> <@> <id>HGNC</id> </@> <desc> <.>Implicated in</.> <b>Hemochromatosis</b> </desc> </gene> my $id = $gene->get(‘@/id’);
Stag alternate syntaxes dataset: gene_set: gene: symbol: HGNC tax_id: 9606 phenotype: Hemochromatosis phenotype: Porphyria variegata mol_function: iron homeostasis position: start: 10000 end: 5000 chromosome: 6 (dataset (gene_set (gene (symbol "HGNC") (tax_id "9606") (phenotype "Hemochromatosis") (phenotype "Porphyria variegata") (mol_function "iron homeostasis") (position (start "10000") (end "5000") (chromosome "6")))))
Stag event model • Like SAX, but more perlish • Data::Stag::BaseGenerator • Data::Stag::BaseHandler • Good for implementing parsers • compatible with SAX • slow!
Data::Stag features • Stag can be used with or without XML • TODO: “strict” mode • Speed issues • Stag queries vs XPATH?
SQL Databases • Relational databases are indispensable in bioinformatics • scalable with large datasets • expressive modeling • queries • Impedance mismatch problem • relational model and imperative programming languages are not well-matched
Object Relational Mapping • Tangram, Alzabo, Class::DBI, … • Object-centric • DBStag is SQL-centric
Why perl programmers don’t like SQL SELECT * FROM gene NATURAL JOIN gene_phenotype AS gp NATURAL JOIN taxon WHERE …
SELECT * FROM gene NATURAL JOIN gene_phenotype NATURAL JOIN taxon USE NESTING (gene_set(gene(taxon)(gene_phenotype))) <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <taxon> <name>human</name> <genus>Homo</genus> <species>sapiens</genus> </taxon> <gene_phenotype> <phenotype>Hemochromatosis</phenotype> </gene_phenotype> <gene_phenotype> <phenotype>Porphyria variegata</phenotype> </gene_phenotype> </gene>
relational-to-XML mappings • Tables mapped to non-terminal nodes • Columns mapped to terminal node (PCDATA) • Table nodes are nested according to: • order in SQL query • USE NESTING clause • Tables aliases add extra level of nesting
Can produce ‘unusual’ XML <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <taxon> <name>human</name> <genus>Homo</genus> <species>sapiens</genus> </taxon> <gene_phenotype> <phenotype>Hemochromatosis</phenotype> </gene_phenotype> <gene_phenotype> <phenotype>Porphyria variegata</phenotype> </gene_phenotype> </gene>
Desired XML? <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <taxon> <name>human</name> <genus>Homo</genus> <species>sapiens</genus> </taxon> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> </gene>
use DBIx::DBStag; my $dbh = DBIx::DBStag->connect(“mygenedb”); my $gene_set = $dbh->selectall_stag(q[ SELECT * FROM gene NATURAL JOIN gene_phenotype NATURAL JOIN taxon USE NESTING (gene_set(gene(taxon)(gene_phenotype))) ]); foreach my $gene (@$gene_set) { printf “Gene: %s [in %s]\n”, $gene->sget_symbol, $gene->sget(‘taxon/name’); foreach my $gp ($gene->get_gene_phenotype) { printf “ %s\n”, $gp->sget_phenotype; } } Gene: HGNC [in human] Hemochromatosis Poryphria varietagata Gene: amontillado [in fruitfly] Abnormal hatching hypoactive
DBStag Templates gene_query.stg: SELECT * FROM gene NATURAL JOIN gene_phenotype NATURAL JOIN taxon NATURAL JOIN gene_function WHERE [gene.symbol => %gene_symbol%] [gene.gene_id IN (SELECT gene_id FROM gene_phenotype WHERE phenotype => %phenotype%)] [gene.gene_id IN (SELECT gene_id FROM gene_function WHERE desc => %function%)] USE NESTING (gene_set(gene(taxon) (gene_phenotype) (gene_function)))
Using templates use DBIx::DBStag; my $gene_set = DBIx::DBStag->selectall_stag( -template=>”gene_query”, -bind=>{function=>”*iron*”, phenotype=>”*hemo*”}); unix> selectall_xml.pl -d mygenedb /gene_query \ function=‘*iron*’ phenotype=‘*hemo*’ > \ genes.xml
Storing data $gene_set = Data::Stag->parse(“genes.xml”); $dbh->storenode($gene_set); unix> stag-storenode.pl -d mygenedb genes.xml additional metadata required for - many-to-many relations - foreign keys to tables with non-matching names
XORT mode <gene_set> <taxon op=“lookup” id=“hsa”> <genus>Homo</genus> <species>sapiens</genus> </taxon> <gene> <symbol op=“update”>HGNC</symbol> <taxon_id>hsa</taxon_id> <gene_phenotype> <phenotype>Hemochromatosis</phenotype> </gene_phenotype> <gene_phenotype> <phenotype>Porphyria variegata</phenotype> </gene_phenotype> </gene>
Automatic schema building from XML dataset • Steps: • pre-process XML • run stag-autoddl.pl on XML • run stag-storenode.pl • see DBIx::DBStag::Cookbook • When to use: • rapid development • ad-hoc analyses
TODO • SOAP bindings • Combinable templates • Address speed issues
Should you use it? • Data::Stag • when you don’t have an existing object model • you prefer to separate behaviour from the data model • if you like element-y XML • DBIx::DBStag • your data naturally ‘nests’ • you need access to complex queries in SQL • little or no inheritance
DBStag and web development • SQL Templates separate query logic from display logic • Rendering HTML: • XSLT • HTML Templating toolkit and Stag objects
DBStag in use • Analysis of gene structure in 12 insects • Building of a data warehouse for the Gene Ontology • Database of human genetic variation and disease • Library of templates for many common bio-schemas
Learning more • http://stag.sourceforget.net