1 / 42

Decoding the genome with perl, XML and SQL

Decoding the genome with perl, XML and SQL. Chris Mungall Howard Hughes Medical Institute UC Berkeley Lawrence Berkeley Lab. Outline. hacking the genome Data::Stag nested tag-value data and XML DBIx::DBStag Relational to XML mapping. CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT

mandel
Download Presentation

Decoding the genome with perl, XML and SQL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decoding the genome with perl, XML and SQL Chris Mungall Howard Hughes Medical Institute UC Berkeley Lawrence Berkeley Lab

  2. Outline • hacking the genome • Data::Stag • nested tag-value data and XML • DBIx::DBStag • Relational to XML mapping

  3. CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT TTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCG CGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAAC CCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTA ATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGC GTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCT CTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGA GAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAG AGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTG TCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGA ACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAA TATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTT TTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGT TGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATAT ATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCT CTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACA MVADERDDNIFKLPLIRNKFIATLNTAQYM IAYARVVPTYCANECLSFSVLYYRKPKKTI HDREREQRRYLDCLLNMIAYARVVPTYCAL YIMTASHSVLFYRKPKSTMHDRGSRTDI..

  4. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S The generic genome browser: a building block for a model organism system database. Genome Research 2002 12:1599-1610.

  5. Use case: a simple object model for genes

  6. A simple object model for genes

  7. package Bio::Gene; use base qw(Bio::Entity); =head2 symbol Usage: $gene->symbol(“HGNC”); $symbol = $gene->symbol; =cut sub symbol { my $self = shift; $self->{_symbol} = shift if @_; return $self->{_symbol}; }

  8. package Bio::Gene; use base qw(Bio::Entity); use Class::MethodMaker [ scalar => [qw/symbol taxon/], array => [qw/phenotypes functions positions/] ]; package Bio::Taxon; use base qw(Bio::Entity); use Class::MethodMaker [ scalar => [qw/id common_name genus species parent_node/], ];

  9. Perl meta-object systems • eg Class::MethodMaker • fun! • Fine for ‘hermetic’ perl • but how do we interoperate with • other languages • databases • XML • No way of “querying” objects

  10. Data::Stag • Objects and XML • perl objects are D(A)G-ish • XML is tree like • Why not merge the two? • YAML • alternate exchange format with perl-ish semantics • Stag • nested tag-value pairs • perl objects with XMLish semantics • XML-centric

  11. <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene> …

  12. my $dataset = Data::Stag->parse(“genes.xml”); my $gene_set = $dataset->get_gene_set; foreach my $gene ($gene_set->get_gene) { next unless $gene->sget_tax_id == 9606; printf “%s\n”, $gene->sget_symbol; printf “ %s\n”, foreach $gene->get_phenotype; } <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene>

  13. Stag trees • Recursive data structure • node-labeled tree • Implemented as nested lists • $node = [$name,$value] • $value = DATA or list of $nodes • Equivalent to element-only XML

  14. Stag accessors • Tag-value pairs • $fooval = $obj->get_FOO; # wantarray=false • $fooval = $obj->sget_FOO; • @foovals = $obj->get_FOO; # wantarray=true • $obj->set_FOO($fooval) • No distinction between single and multi valued tags • ordered (no hashes)

  15. my $dataset = Data::Stag->parse(“genes.xml”); my @symbols = $dataset->find_symbol; <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene>

  16. my $dataset = Data::Stag->parse(“genes.xml”); my @symbols = $dataset->get(‘gene_set/gene/symbol’); <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene>

  17. my $dataset = Data::Stag->parse(“genes.xml”); my @genes = $dataset->where(‘gene’, sub { shift->sget(‘position/start’) < 5000}); <dataset> <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> <mol_function>iron homeostasis</mol_function> <position> <start>10000</start> <end>5000</end> <chromosome>6</chromosome> </position> </gene>

  18. Attributes and mixed elements <gene id=“HGNC”> <desc> implicated in <b>Hemochromatosis</b> </desc> </gene> <gene> <@> <id>HGNC</id> </@> <desc> <.>Implicated in</.> <b>Hemochromatosis</b> </desc> </gene> my $id = $gene->get(‘@/id’);

  19. Stag alternate syntaxes dataset: gene_set: gene: symbol: HGNC tax_id: 9606 phenotype: Hemochromatosis phenotype: Porphyria variegata mol_function: iron homeostasis position: start: 10000 end: 5000 chromosome: 6 (dataset (gene_set (gene (symbol "HGNC") (tax_id "9606") (phenotype "Hemochromatosis") (phenotype "Porphyria variegata") (mol_function "iron homeostasis") (position (start "10000") (end "5000") (chromosome "6")))))

  20. Stag event model • Like SAX, but more perlish • Data::Stag::BaseGenerator • Data::Stag::BaseHandler • Good for implementing parsers • compatible with SAX • slow!

  21. Data::Stag features • Stag can be used with or without XML • TODO: “strict” mode • Speed issues • Stag queries vs XPATH?

  22. SQL Databases • Relational databases are indispensable in bioinformatics • scalable with large datasets • expressive modeling • queries • Impedance mismatch problem • relational model and imperative programming languages are not well-matched

  23. Object Relational Mapping • Tangram, Alzabo, Class::DBI, … • Object-centric • DBStag is SQL-centric

  24. Why perl programmers don’t like SQL SELECT * FROM gene NATURAL JOIN gene_phenotype AS gp NATURAL JOIN taxon WHERE …

  25. SELECT * FROM gene NATURAL JOIN gene_phenotype NATURAL JOIN taxon USE NESTING (gene_set(gene(taxon)(gene_phenotype))) <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <taxon> <name>human</name> <genus>Homo</genus> <species>sapiens</genus> </taxon> <gene_phenotype> <phenotype>Hemochromatosis</phenotype> </gene_phenotype> <gene_phenotype> <phenotype>Porphyria variegata</phenotype> </gene_phenotype> </gene>

  26. relational-to-XML mappings • Tables mapped to non-terminal nodes • Columns mapped to terminal node (PCDATA) • Table nodes are nested according to: • order in SQL query • USE NESTING clause • Tables aliases add extra level of nesting

  27. Can produce ‘unusual’ XML <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <taxon> <name>human</name> <genus>Homo</genus> <species>sapiens</genus> </taxon> <gene_phenotype> <phenotype>Hemochromatosis</phenotype> </gene_phenotype> <gene_phenotype> <phenotype>Porphyria variegata</phenotype> </gene_phenotype> </gene>

  28. Desired XML? <gene_set> <gene> <symbol>HGNC</symbol> <tax_id>9606</tax_id> <taxon> <name>human</name> <genus>Homo</genus> <species>sapiens</genus> </taxon> <phenotype>Hemochromatosis</phenotype> <phenotype>Porphyria variegata</phenotype> </gene>

  29. use DBIx::DBStag; my $dbh = DBIx::DBStag->connect(“mygenedb”); my $gene_set = $dbh->selectall_stag(q[ SELECT * FROM gene NATURAL JOIN gene_phenotype NATURAL JOIN taxon USE NESTING (gene_set(gene(taxon)(gene_phenotype))) ]); foreach my $gene (@$gene_set) { printf “Gene: %s [in %s]\n”, $gene->sget_symbol, $gene->sget(‘taxon/name’); foreach my $gp ($gene->get_gene_phenotype) { printf “ %s\n”, $gp->sget_phenotype; } } Gene: HGNC [in human] Hemochromatosis Poryphria varietagata Gene: amontillado [in fruitfly] Abnormal hatching hypoactive

  30. DBStag Templates gene_query.stg: SELECT * FROM gene NATURAL JOIN gene_phenotype NATURAL JOIN taxon NATURAL JOIN gene_function WHERE [gene.symbol => %gene_symbol%] [gene.gene_id IN (SELECT gene_id FROM gene_phenotype WHERE phenotype => %phenotype%)] [gene.gene_id IN (SELECT gene_id FROM gene_function WHERE desc => %function%)] USE NESTING (gene_set(gene(taxon) (gene_phenotype) (gene_function)))

  31. Using templates use DBIx::DBStag; my $gene_set = DBIx::DBStag->selectall_stag( -template=>”gene_query”, -bind=>{function=>”*iron*”, phenotype=>”*hemo*”}); unix> selectall_xml.pl -d mygenedb /gene_query \ function=‘*iron*’ phenotype=‘*hemo*’ > \ genes.xml

  32. Storing data $gene_set = Data::Stag->parse(“genes.xml”); $dbh->storenode($gene_set); unix> stag-storenode.pl -d mygenedb genes.xml additional metadata required for - many-to-many relations - foreign keys to tables with non-matching names

  33. XORT mode <gene_set> <taxon op=“lookup” id=“hsa”> <genus>Homo</genus> <species>sapiens</genus> </taxon> <gene> <symbol op=“update”>HGNC</symbol> <taxon_id>hsa</taxon_id> <gene_phenotype> <phenotype>Hemochromatosis</phenotype> </gene_phenotype> <gene_phenotype> <phenotype>Porphyria variegata</phenotype> </gene_phenotype> </gene>

  34. Automatic schema building from XML dataset • Steps: • pre-process XML • run stag-autoddl.pl on XML • run stag-storenode.pl • see DBIx::DBStag::Cookbook • When to use: • rapid development • ad-hoc analyses

  35. TODO • SOAP bindings • Combinable templates • Address speed issues

  36. Should you use it? • Data::Stag • when you don’t have an existing object model • you prefer to separate behaviour from the data model • if you like element-y XML • DBIx::DBStag • your data naturally ‘nests’ • you need access to complex queries in SQL • little or no inheritance

  37. DBStag and web development • SQL Templates separate query logic from display logic • Rendering HTML: • XSLT • HTML Templating toolkit and Stag objects

  38. DBStag in use • Analysis of gene structure in 12 insects • Building of a data warehouse for the Gene Ontology • Database of human genetic variation and disease • Library of templates for many common bio-schemas

  39. Learning more • http://stag.sourceforget.net

More Related