210 likes | 227 Views
Nexml. Rutger Vos and Wayne Maddison University of British Columbia. Introduction (1/5) The idea. A file format like nexus, but: Fixes (some) problems with nexus Gives access to data at higher level Extensible Exposes data to xml goodies. Introduction (2/5) Nexus problems.
E N D
Nexml Rutger Vos and Wayne Maddison University of British Columbia
Introduction (1/5)The idea • A file format like nexus, but: • Fixes (some) problems with nexus • Gives access to data at higher level • Extensible • Exposes data to xml goodies
Introduction (2/5)Nexus problems • Hard/impossible to validate • No explicit versions • Nothing ever deprecated • No public extensions • Leads to hacks such as ‘mixed’ data, ‘hot comments’ • Phylogenetics post-’80s in private blocks
Introduction (3/5)Higher level data access • Processing nexus data involves lexing + parsing + processing • XML allows choosing a parser library, data can be processed as a structure that hides tokenization issues
Introduction (4/5)Extensibility • ‘Extensible’ file format should, more robustly than NEXUS, provide the ability to: • define new data types that implement described ‘interfaces’ • attach typed data structures to core types • attach custom XML
Introduction (5/5)XML goodies • Large stack of off-the-shelf tools: • XML parser libraries • Webservices • Native XML databases • Editors/IDEs • Serialization tools
Design (1/4)Design principles • Re-use of prior art • Follow design patterns • Referencing • Verbose and compact representations
Design (2/4)Re-use of prior art • Generic key/value attachments following apple’s plist semantics: <dict> <key>prior</key> <float>0.78</float> </dict> • Trees and networks following graphml • General file structure following nexus concepts, i.e. blocks that reference each other
Design (3/4)XML design patterns • http://www.xmlpatterns.com • “Declare before use” • “Metadata first” • “Venetian blinds” • Abstract inheritance through extension, concrete inheritance through restriction
Design (4/4)Referencing • Elements sometimes refer to other elements, much like in nexus • In nexml, elements refer to the id of other elements by the name of the referenced element: <taxon id="t1"/> <!-- i.e. OTU, referenced later as: --> <node id="n1" taxon="t1"/>
Nexml (1/8)Approach • Schema design • Community feedback through wiki, email, telecon, meetings (evoinfo, ppod) etc. • Processors (perl+mesquite+python) development in parallel • Experiments with xml tools (ws, db, serialization)
Nexml (2/8) root element • version="1.0" • generator="mesquite" • Versioned namespace: xmlns:nex="http://www.nexml.org/1.0"
Nexml (3/8)inheritance tree for elements “Base”, optional base/lang/href attributes extends “Annotated”, optional dict elements extends “Labelled”, optional label attribute extends “IDTagged”, required id attribute extends “AbstractElement”, in root schema restricts “ConcreteElement”, in instance document
Nexml (4/8) anatomy of a “block” Name (e.g. "characters"), id attribute, xsi:type concrete subclass attribute (e.g. "nex:DnaSeq"), possible reference to other element: <characters id="c1" xsi:type="nex:DnaSeqs" taxa="t1"> </characters> Metadata attachment: <dict><key>desc</key><string>description…</string></dict> Contents…
Nexml (5/10)Character Classes Granularity Data type
Nexml (6/10)Tree Classes Branch type Topology
Nexml (7/10)blocks, current status • Done: • OTUs • characters: dna, rna, nucleotide, protein, categorical, continuous, restriction (compact and verbose) • trees: graphml trees and networks
Nexml (8/10)blocks, current status • To do: • sets (in progress, using class identifiers) • substitution model descriptions (KS progress) • more restricted vocabulary attachments (Darwin core) • distances • splits • cross-reference with glossary, ontology • follow up on earlier feedback (small fixes)
Nexml (9/10)Experiments • XML parsers: expat, libxml2, jdom • Processed schema using xmlbeans • Included schema in soap wsdl • Indexed files in dbxml • Created large files from tolweb, rbcl • XInclude with tinyseq xml • REST service described using nexml
Nexml (10/10)Resources • GSoC • https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Phylogenetic_XML • Base URL • http://www.nexml.org • SVN • http://nexml07gsoc.googlecode.com/svn/trunk/ • Wiki • https://www.nescent.org/wg_evoinfo/Future_Data_Exchange_Standard • SourceForge repository
Acknowledgements • Contributions: Jason Caravas, Mark Holder, Peter Midford, Jeet Sukumaran • Feedback: wg-evoinfo, pPOD • Additional funding, support: NESCent, GSoC