1 / 21

Nexml

Nexml. Rutger Vos and Wayne Maddison University of British Columbia. Introduction (1/5) The idea. A file format like nexus, but: Fixes (some) problems with nexus Gives access to data at higher level Extensible Exposes data to xml goodies. Introduction (2/5) Nexus problems.

asleyj
Download Presentation

Nexml

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nexml Rutger Vos and Wayne Maddison University of British Columbia

  2. Introduction (1/5)The idea • A file format like nexus, but: • Fixes (some) problems with nexus • Gives access to data at higher level • Extensible • Exposes data to xml goodies

  3. Introduction (2/5)Nexus problems • Hard/impossible to validate • No explicit versions • Nothing ever deprecated • No public extensions • Leads to hacks such as ‘mixed’ data, ‘hot comments’ • Phylogenetics post-’80s in private blocks

  4. Introduction (3/5)Higher level data access • Processing nexus data involves lexing + parsing + processing • XML allows choosing a parser library, data can be processed as a structure that hides tokenization issues

  5. Introduction (4/5)Extensibility • ‘Extensible’ file format should, more robustly than NEXUS, provide the ability to: • define new data types that implement described ‘interfaces’ • attach typed data structures to core types • attach custom XML

  6. Introduction (5/5)XML goodies • Large stack of off-the-shelf tools: • XML parser libraries • Webservices • Native XML databases • Editors/IDEs • Serialization tools

  7. Design (1/4)Design principles • Re-use of prior art • Follow design patterns • Referencing • Verbose and compact representations

  8. Design (2/4)Re-use of prior art • Generic key/value attachments following apple’s plist semantics: <dict> <key>prior</key> <float>0.78</float> </dict> • Trees and networks following graphml • General file structure following nexus concepts, i.e. blocks that reference each other

  9. Design (3/4)XML design patterns • http://www.xmlpatterns.com • “Declare before use” • “Metadata first” • “Venetian blinds” • Abstract inheritance through extension, concrete inheritance through restriction

  10. Design (4/4)Referencing • Elements sometimes refer to other elements, much like in nexus • In nexml, elements refer to the id of other elements by the name of the referenced element: <taxon id="t1"/> <!-- i.e. OTU, referenced later as: --> <node id="n1" taxon="t1"/>

  11. Nexml (1/8)Approach • Schema design • Community feedback through wiki, email, telecon, meetings (evoinfo, ppod) etc. • Processors (perl+mesquite+python) development in parallel • Experiments with xml tools (ws, db, serialization)

  12. Nexml (2/8) root element • version="1.0" • generator="mesquite" • Versioned namespace: xmlns:nex="http://www.nexml.org/1.0"

  13. Nexml (3/8)inheritance tree for elements “Base”, optional base/lang/href attributes extends “Annotated”, optional dict elements extends “Labelled”, optional label attribute extends “IDTagged”, required id attribute extends “AbstractElement”, in root schema restricts “ConcreteElement”, in instance document

  14. Nexml (4/8) anatomy of a “block” Name (e.g. "characters"), id attribute, xsi:type concrete subclass attribute (e.g. "nex:DnaSeq"), possible reference to other element: <characters id="c1" xsi:type="nex:DnaSeqs" taxa="t1"> </characters> Metadata attachment: <dict><key>desc</key><string>description…</string></dict> Contents…

  15. Nexml (5/10)Character Classes Granularity Data type

  16. Nexml (6/10)Tree Classes Branch type Topology

  17. Nexml (7/10)blocks, current status • Done: • OTUs • characters: dna, rna, nucleotide, protein, categorical, continuous, restriction (compact and verbose) • trees: graphml trees and networks

  18. Nexml (8/10)blocks, current status • To do: • sets (in progress, using class identifiers) • substitution model descriptions (KS progress) • more restricted vocabulary attachments (Darwin core) • distances • splits • cross-reference with glossary, ontology • follow up on earlier feedback (small fixes)

  19. Nexml (9/10)Experiments • XML parsers: expat, libxml2, jdom • Processed schema using xmlbeans • Included schema in soap wsdl • Indexed files in dbxml • Created large files from tolweb, rbcl • XInclude with tinyseq xml • REST service described using nexml

  20. Nexml (10/10)Resources • GSoC • https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Phylogenetic_XML • Base URL • http://www.nexml.org • SVN • http://nexml07gsoc.googlecode.com/svn/trunk/ • Wiki • https://www.nescent.org/wg_evoinfo/Future_Data_Exchange_Standard • SourceForge repository

  21. Acknowledgements • Contributions: Jason Caravas, Mark Holder, Peter Midford, Jeet Sukumaran • Feedback: wg-evoinfo, pPOD • Additional funding, support: NESCent, GSoC

More Related