Perl/XML::DOM - reading and writing XML from Perl

Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk http://www.bioinf.org.uk/

Aims and objectives • Refresh the structure of a XML (or XHTML) document • Know problems in reading and writing XML • Understand the requirements of XML parsers and the two main types • Know how to write code using the DOM parser • PRACTICAL: write a script to read XML

Tags: paired opening and closing tags May contain data and/or other (nested) tags Attributes: optional – contained within the opening tag un-paired tags use special syntax An XML refresher! <mutants> <mutant_group native='1abc01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure> <mutant domid='2bcd01'> <structure> <method>x-ray</method> <resolution>1.8</resolution> <rfactor>0.20</rfactor> </structure> <mutation res='L24' native='ALA' subs='ARG' /> </mutant> </mutant_group> </mutants>

Writing XML Writing XML is straightforward • Generate XML from a Perl script using print() statements. • However: • tags correctly nested • quote marks correctly paired • international character sets

Reading XML As simple or complex as you wish! • Full control over XML: • simple pattern may suffice • Otherwise, may be dangerous • may rely on un-guaranteed formatting <mutants><mutant_group native='1abc01'><structure><method> x-ray</method><resolution>1.8</resolution><rfactor>0.20 </rfactor></structure><mutant domid='2bcd01'><structure> <method>x-ray</method><resolution>1.8</resolution> <rfactor>0.20</rfactor></structure><mutation res='L24’ native='ALA’ subs='ARG'/></mutant></mutant_group></mutants>

XML Parsers • Clear rules for data boundaries and hierarchy • Predictable; unambiguous • Parser translates XML into • stream of events • complex data object

XML Parsers • Different data sources of data • files • character strings • remote references • different character encodings • standard Latin • Japanese • checking for well-formedness errors Good parser will handle:

XML Parsers • Read stream of characters • differentiate markup and data • Optionally replace entity references • (e.g. < with <) • Assemble complete document • disparate (perhaps remote) sources • Report syntax and validation errors • Pass data to client program

XML Parsers • If XML has no syntax errors it is 'well formed' • With a DTD, a validating parser will check it matches:'valid'

XML Parsers • Writing a good parser is a lot of work! • A lot of testing needed • Fortunately, many parsers available

Getting data to your program • Parser can generate 'events' • Tags are converted into events • Events triggered in your program as the document is read • Parser acts as a pipeline converting XML into processed chunks of data sent to your program: • an 'event stream'

Getting data to your program OR… • XML converted into a tree structure • Reflects organization of the XML • Whole document read into memory before your program gets access

Pros and cons • In the parser, everything is likely to be event-driven • tree-based parsers create a data structure from the event stream • Data structure • More convenient • Can access data in any order • Code usually simpler • May be impossible to handle very large files • Need more processor time • Need more memory • Event stream • Faster to access limited data • Use less memory • Parser loses data at the next event • More complex code

SAX and DOM de facto standard APIs for XML parsing • SAX (Simple API for XML) • event-stream API • originally for Java, but now for several programming languages (including Perl) • development promoted by Peter Murray Rust, 1997-8 • DOM (Document Object Model) • W3C standard tree-based parser • platform- and language-neutral • allows update of document content and structure as well as reading

Perl XML parsers Many parsers available • Differ in three major ways: • parsing style (event driven or data structure) • 'standards-completeness’ • speed (implementation in C or pure Perl)

Perl XML parsers XML::Simple • Very easy to use • Designed for simple applications • Can’t handle 'mixed content' • tags containing both data and other tags <p>This is <b>mixed</b> content</p>

Perl XML parsers XML::Parser • Oldest Perl XML parser • Reasonably fast and flexibile • Not very standards-compliant. • Is a wrapper to 'expat’ • probably the first C XML parser written by James Clark

Remember 2 lines before an error $@ stores the error Eval used so parser doesn’t cause program to exit XML::Parser use XML::Parser; my $xmlfile = shift @ARGV; # the file to parse # initialize parser object and parse the string my $parser = XML::Parser->new( ErrorContext => 2 ); eval { $parser->parsefile( $xmlfile ); }; # report error or success if( $@ ) { $@ =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in '$xmlfile':\n$@\n"; } else { print STDERR "'$xmlfile' is well-formed\n"; } Simple example - check well-formedness

Perl XML parsers XML::DOM • Implements W3C DOM Level 1 • Built on top of XML::Parser • Very good fast, stable and complete • Limited extended functionality XML::SAX • Implements SAX2 wrapper to Expat • Fast, stable and complete

Perl XML parsers XML::LibXML • Wrapper around GNOME libxml2 • Very fast, complete and stable • Validating/non-validating • DOM and SAX support

Perl XML parsers XML::Twig • DOM-like parser, BUT • Allows you to define elements which can be parsed as discrete units • 'twigs' (small branches of a tree)

Perl XML parsers Several others • More specialized, adding… • XPath (to select data from XML document) • re-formatting (XSLT or other methods) • ...

XML::DOM • DOM is a standard API • once learned moving to a different language is straightforward • moving between implementations also easy • Suppose we want to extract some data from an XML file...

XML::DOM <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species> </data> We want: cat (Felix domesticus) not endangered fruit fly (Drosophila melanogaster) not endangered

Import XML::DOM module obtain filename initialize a parser parse the file Can now treat $species in the same way as $doc Returns each element matching ‘species’ Nested loops correspond to nested tags Look at each species element in turn $species contains content and descendant elements #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose();

$common_name contains content and any descendant elements Returns each element matching ‘common-name’ Here there is only one <common-name> element per <species> element, but parser can't know that. Have to specify that we want the first (and only) <common-name> element Extract the text from the element object using ->getNodeValue The <common-name> element contains only one child which is data. Obtain first (and only) child element using ->getFirstChild #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose();

Here we obtain an object and use the ->item() method to obtain the first item: $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; Alternative is to access an individual array element: @common_names = $species->getElementsByTagName('common-name'); $cname = $common_names[0]->getFirstChild->getNodeValue; ->getElementsByTagName returns an array Here we work through the array: foreach $species ($doc->getElementsByTagName('species')) { }

There could have been more than one <common-name> element within this <species> element Obtain the actual text rather than an element object <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue;

Attributes much simpler! Can only contain text (no nested elements) Simply specify the attribute <species name='Felix domesticus'> Attributes #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose();

DOM parser can't know there is only one <conservation> element per species. Extract the first (and only) one. Extract the <conservation> elements from this <species> element Contains only a ‘status’ attribute, extract its value with ->getAttribute() <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> This is an empty element, there are no child elements so we don’t need ->getFirstChild $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status');

Print the extracted information Loop back to next <species> element Clean up and free memory #!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose();

XML::DOM Note • Not necessary to use variable names that match the tags, but it is a very good idea! • There are many many more functions, but this set covers most needs

Writing XML with XML::DOM

Import XML::DOM Initialize data Create an XML document object Utility method to print XML header <?xml version=“1.0” ?> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

Create a ‘root’ element for our data Loop through each of the species Create a <species> element Set its ‘name’ attribute Join the <species> element to the parent <data> element #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

<?xml version=“1.0” ?> <data> <species name='Felix domesticus’ /> </data>

Create a <common-name> element Create a text node containing the specified text Join this text node as a child of the <name> element Join the <name> element as a child of <species> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

<?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> </species> </data>

Create a <conservation> element Set the ‘status’ attribute Join the <conservation> element as a child of <species> #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

<?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> </data>

Loop back to handle the next species Finally print the resulting data structure #!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString;

<?xml version=“1.0” ?> <data> <species name='Felix domesticus'> <common-name>cat</common-name> <conservation status='not endangered' /> </species> <species name='Drosophila melanogaster'> <common-name>fruit fly</common-name> <conservation status='not endangered' /> </species> </data>

Summary - reading XML Create a parser $parser = XML::DOM::Parser->new(); Parse a file $doc = $parser->parsefile('filename'); Extract all elements matching tag-name $element_set = $doc->getElementsByTagName('tag-name') Extract first element of a set $element = $element_set->item(0); Extract first child of an element $child_element = $element->getFirstChild; Extract text from an element $text = $element->getNodeValue; Get the value of a tag’s attribute $text = $element->getAttribute('attribute-name');

Summary - writing XML Create an XML document structure $doc = XML::DOM::Document->new; Utility to create an XML header $header = $doc->createXMLDecl('1.0'); Create a tagged element $element = $doc->createElement('tag-name'); Set an attribute for an element $element->setAttribute('attrib-name', 'value'); Append a child element to an element $parent_element->appendChild($child_element); Create a text node element $element = $doc->createTextNode('text'); Print a document structure as a string print $root_element->toString;

Summary • Two types of parser • Event-driven • Data structure • Writing a good parser is difficult! • Many parsers available • XML::DOM for reading and writing data

Perl/XML::DOM - reading and writing XML from Perl