130 likes | 147 Views
Extensible Markup Language: XML. HTML: portable, widely supported protocol for describing how to format data XML: portable, widely supported protocol for describing data XML is quickly becoming standard for data exchange between applications. XML Documents.
E N D
Extensible Markup Language: XML • HTML: portable, widely supported protocol for describing how to format data • XML: portable, widely supported protocol for describing data • XML is quickly becoming standard for data exchange between applications
XML Documents • XML marks up data using tags, which are names enclosed in angle brackets < > • All tags appear in pairs: <myTag> .. </myTag> • Elements: units of data (i.e., anything between a start tag and its corresponding end tag) • Root element contains all other document elements • Tag pairs cannot appear interleaved: <a><b></a></b> Must be: <a><b></b></a> • Nested elements form trees What defines an XML document is not its tag names but that it has tags that are formatted in this way.
Optional XML declaration includes version information parameter (MUST be very first line of file) Root element contains all other document elements article Because of the nice<tag>.. </tag>structure, the data can be viewed as organized in a tree: title date author summary content firstName lastName
<?xml version = "1.0"?> <!– I-sequence structured with XML. --> <SEQUENCEDATA> <TYPE>dna</TYPE> <SEQ> <NAME>Aspergillus awamori</NAME> <ID>U03518</ID> <DATA>aacctgcggaaggatcattaccgagtgcgggtcctttgggccca acctcccatccgtgtctattgtaccctgttgcttcgg cgggcccgccgcttgtcggccgccgggggggcgcctctg ccccccgggcccgtgcccgccggagaccccaacacgaac actgtctgaaagcgtgcagtctgagttgattgaatgcaat cagttaaaactttcaacaatggatctcttggttccggc </DATA> </SEQ> </SEQUENCEDATA> An I-sequence might be structured as XML like this.. comment SEQUENCEDATA SEQ TYPE NAME ID DATA
Parsing and displaying XML • XML is just another data format • We need to write yet another parser • No more filters, please! ? • No! XML is becoming standard • Many different systems can read XML – not many systems can read our I-sequence format.. • Thus, parsers exist already
XML document opened in Internet Explorer Standard browsers can format XML documents nicely! Minus sign Each parent element/node can be expanded and collapsed Plus sign
XML document opened in Mozilla Again: Each parent element/node can be expanded and collapsed (here by pressing the minus, not the element)
Attributes Data can also be placed in attributes: name/value pairs Attribute (name-value pair, value in quotes): elementcontacthas the attributetypewhich has the value“to” Empty elements are elements with no character data between the tags. The tags of an empty element may be written in one like this:<myTag /> letter.xml
Parsers and trees • We’ve already seen that XML markup can be displayed as a tree • Some XML parsers exploit this. They • parse the file • extract the data • return it organized in a tree data structure called a Document Object Model article title date author summary content firstName lastName
Document Object Model (DOM) • a DOM parser retrieves data from XML document • return tree structure called a DOM tree • Each component of an XML document represented as a tree node • Parent nodes contain child nodes • Sibling nodes have same parent • Single root (or document) node contains all other document nodes
Python provides a DOM parser! • All nodes have name (of tag) and value • Text (including whitespace) represented in nodes with tag name #text #text #text Simple XML title #text #text Dec..2001 date #text #text John #text firstName article author #text #text Doe #text lastName #text XML..easy. summary #text #text #text In this..XML. content #text
NB: Changes since book! Parse XML document and load data into variable document documentElementattribute refers to root node fig16_04revised.py nodeNamerefers to element’s tagname Various node attributes: firstChild nextSibling nodeValue parentNode
Program output Here is the root element of the document: article The following are its child elements: #text title #text date #text author #text summary #text content #text The first child of root element is: #text whose next sibling is: title Text inside "title" tag is Simple XML Parent node of title is: article #text #text Simple XML title #text #text Dec..2001 date #text #text John #text firstName article author #text #text Doe #text lastName #text XML..easy. summary #text #text #text In this..XML. content #text