310 likes | 429 Views
Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for: APIII - Advancing Practice Instruction and Innovation through Informatics Marriott City Center, Pittsburgh, PA Friday, October 10, 2003 Session E2 Perl and Python Programming Workshop
E N D
Writing Simple Perl Scripts to Create, Convert and Analyze XML Documents Presented for: APIII - Advancing Practice Instruction and Innovation through Informatics Marriott City Center, Pittsburgh, PA Friday, October 10, 2003 Session E2 Perl and Python Programming Workshop Session Organizers: Jules Berman and Jim Harrison Jules J. Berman, Ph.D., M.D. Program Director for Pathology Informatics Cancer Diagnosis Program National Cancer Institute National Institutes of Health Rockville, MD
Virtually everything presented can be reviewed at you leisure at: http://65.222.228.150/jjb/tutor.htm This site contains literally hundreds of Perl programming tips and scripts
What is the purpose of XML? XML allows heterogeneous systems to communicate and exchange their data It achieves this through metadata (data about data). Can produce an ideal document that completely describes itself, including all data and all metadata.
COMMON XML TASKS 1. Converting an HTML file to an XML file. 2. Converting an XML file to an HTML file (e.g. making an XML file presentable while preserving its information content) 3. Converting an Excel file to an XML file Converting an XML file to a different data structure (e.g. moving XML into a standard database) 4. Querying an XML file 5. Querying multiple XML files for related information
Lets do a simple conversion of an html file to an XML file. Here’s the html file (notice that the top header information has been removed) <body> <h1>Simple HTML document</h1> <br>List to follow: <ul> <li>First <li>Second <li>Third </ul> </body> </html>
open (TEXT, "html.htm")||die"Cannot"; #substitute your html page open (STDOUT, ">html.xml")||die"Cannot"; #substitute your html page print "\<\?xml version \= \"1\.0\" encoding \= \"ISO\-8859\-1\"\?\>\n"; $line = " "; %dictionary = ( "body" => "document", "h1" => "title", "ul" => "list", "ol" => "list" ); @keysarray = keys(%dictionary); while ($line ne "") { $line = <TEXT>; $line =~ s/\<\/html\>//; $line =~ s/\n//; if ($line =~ /^\<br\>/) { $line = "\<line\>$'\<\/line\>"; print $line; next; } if ($line =~ /^\<li\>/) { $line = "<item>$'\<\/item\>"; print $line; next; } foreach $key (@keysarray) { $line =~ s/(\<[\/]?)$key/$1$dictionary{$key}/g; } print $line; } exit;
Most important parts of HTML->XML script: %dictionary = ( "body" => "document", "h1" => "title", "ul" => "list", "ol" => "list" ); @keysarray = keys(%dictionary); foreach $key (@keysarray) { $line =~ s/(\<[\/]?)$key/$1$dictionary{$key}/g; }
Converting an XML file to an HTML file (many many different ways to do this)
Converting an XML file to an HTML file: use XML::Parser; #calls an external module open (STDOUT, “>output.htm"); my $parser = XML::Parser->new( Handlers => { Init => \&handle_doc_start, Final => \&handle_doc_end, Start => \&handle_elem_start, End => \&handle_elem_end, Char => \&handle_char_data, }); my $file = "presum.xml"; $parser -> parsefile($file);
sub handle_doc_start { my $header = <<HEADER; <html> <head> <title> Precancer Classification </title> </head> <body> <center><h1>Precancer Classification</h1></center> <br> <br> HEADER print $header; }
sub handle_doc_end { my $header = <<HEADER; <br> </body> </html> HEADER print $header; }
sub handle_elem_start { my ($expat, $name, %atts) = @_; if ($name eq "concept") { $count++; print "\<br\><font color=\"0000ff\">$name $count</font><ul>\n"; return; } } Etc., etc., etc.,
Remember: Perl XML-related modules can be downloaded/installed at no cost from www.activestate.com ppm service.
PPM> search xml Packages available from http://www.ActiveState.com/PPMPackages/5.6: CGI-Form2XML [1.3 ] Render CGI form input as XML CGI-ToXML [0.02 ] Converts CGI to an XML structure CGI-XML [0.1 ] Perl extension for converting CGI.pm variables to/from XML CGI-XMLForm [0.10 ] Extension of CGI.pm which reads/generates formated XML. CGI-XMLPost [1.3 ] receive XML file as an HTTP POST DBIx-XML-DataLoader [1.1b ] DBIx-XMLMessage [0.05 ] XML Message exchange between DBI data sources DBIx-XML_RDB [0.05 ] Perl extension for creating XML from existing DBI datasources Data-DumpXML [1.05 ] Dump arbitrary data structures as XML GoXML-XQI [1.1.4 ] Perl extension for the XML Query Interface at xqi.goxml.com. HTTP-WebTest-Plugin-XMLReport [1.01 ] Report plugin for HTTP::WebTest generates output in XML format
Tk-XMLViewer [0.15 ] Tk widget to display XML XML-AutoWriter [0.37 ] DOCTYPE based XML output XML-Beautify [0.05 ] Beautifies XML output from XML::Writer (soon to do any XML). XML-DOM [1.25 ] A perl module for building DOM Level 1 compliant document structures XML-DOMHandler [1 ] Implements a call-back interface to DOM. XML-DTDParser [1.7 ] quick and dirty DTD parser XML-Excel [0.02 ] Perl extension converting Excel files to XML XML-Node [0.11 ] Node-based XML parsing: an simplified interface to XML::Parser XML-SAX [0.12 ] Simple API for XML XML-SAX-Base [1.02 ] Base class SAX Drivers and Filters XML-SAX-Builder [0.02 ] build XML documents using SAX XML-SAX-Expat [0.37 ] SAX Driver for Expat XML-SAX-Machines [0.4 ] manage collections of SAX processors
XML-SAX-PurePerl [0.80 ] Pure Perl XML Parser with SAX2 interface XML-SAX-RTF [0.1 ] SAX Driver for Microsoft's Rich Text Format (RTF) XML-SAX-Simple [0.02 ] SAX version of XML::Simple XML-SAX-Writer [0.44 ] SAX2 XML Writer XML-SAXDriver-CSV [0.07 ] SAXDriver for converting CSV files to XML XML-Writer [0.4 ] Perl extension for writing XML documents. XML-Writer-String [ 0.1 ] Capture output from XML::Writer. XML-XPath [1.12 ] a set of modules for parsing and evaluating XPath statements XML-XPath-Simple [0.05 ] Very simple interface for XPaths XML-XPathScript [0.03 ] Stand alone XPathScript XML-XQL [0.68 ] A perl module for querying XML tree structures with XQL XML-XSLT [0.40 ] A perl module for processing XSLT
Creating an XML file from an Excel file • Example is done in Windows, and because it’s using an Windows-based application, and the Windows API, it won’t work in Linux (not Perl’s fault). • There are plenty of other approaches that will work in Linux • Also, requires Excel to be installed. • The complete Perl script is opener7.pl and found in the perl tutorial: http://65.222.228.150/jjb/tutor.htm
Creates a Windows OLE object for Excel - NON_PERLISH my $app = CreateObject OLE "Excel.Application" || die "Can't open"; $app->Workbooks->Open($xlfile); Creates the XML tags by collecting a list of the column headers foreach my $column_place (@column_array) { $thing = $app->Range("${column_place}1")->{'Value'}; if ($thing ne "") { $thing =~ s/ /_/g; $thing =~ s/[^\w0-9]//g; $thing =~ s/2nd/Second/g; $nextthing = "$column_place||$thing"; print "$nextthing\n"; push(@index, $nextthing); } else { last; } }
Creates a Windows OLE object for Excel - NON_PERLISH foreach my $arrayvalue (@index) { $arrayvalue =~ /\|\|/; my $key = $`; my $value = $'; $thing = $app->Range($key . $row)->{'Value'}; #substitute & for & $thing =~ s/\&/\&/; #substitute > for > $thing =~ s/\>/\>/; #substitute < for < $thing =~ s/\</\</; #substitute &apos for ' $thing =~ s/\'/\&apos/; #substitute " for " $thing =~ s/\"/\"/; $thing =~ tr/a-zA-Z0-9 //cd; print " \<$value\>$thing\<\/$value\>\n"; } $row++;
BUILDING THE COOPERATIVE PROSTATE CANCER TISSUE RESOURCE TISSUE MICROARRAY FILE 1. Get xls file with core information TMACPCTR.XLS 98,816 7-24-03 11:17am A 2. convert the xls file to an xml file using opener7.pl OPENER7 .PL 3,663 7-24-03 11:39am A This produces file block2.xml BLOCK2.XML 328,263 7-24-03 11:39am A 3. Add header and trailer information to the xml file
Header information is basically: <?xml version="1.0"?> <histo> <tma> <header> <title>CPCTR Microarray 1</title> <creator>CPCTR</creator> <subject>Tissue Microarrays</subject> <description>CPCTR TMA XML</description> <rights>public domain</rights> <filename>tmacpctr.xml</filename> </header> Trailer information is basically: </core> </block> </tma> </histo> This produces: TMACPCTR .XML 331,636 7-28-03 10:58am A
4. Check validity of the tmacpcrt.xml file using validtma.pl VALIDTMA .PL 9,132 5-21-03 3:06pm A The TMA validating Perl script can be obtained by going to the TMA specification paper: The tissue microarray data exchange specification: A community-based, open source tool for sharing tissue microarray dataJules J Berman1 , Mary E Edgerton2 and Bruce A Friedman3 BMC Medical Informatics and Decision Making 2003 3:5 http://www.biomedcentral.com/1472-6947/3/5 The validating protocol produces a screen output that includes: c:\tmacpctr.xml Begining to parse c:\tmacpctr.xml now. Finished. c:\tmacpctr.xml is a valid Tissue Microarray File. The one-way hash of your file is e2ad62a75974628b7499bd7d771b82f0
Querying an XML file • Many many ways. Most people use XSLT (Extensible Stylesheet Language Transformations) • 2. When you haven’t converted your XML into another data structure (like a database structure) and you’re using straight XML as the document that you’re querying, then a query is the same as a transformation where you through everything away except the stuff that matches your query.
HETEROGENEOUS XML MERGES/QUERIES • Can be thought of as a special form of XSLT • Or as a data structure conversion • Or as a straightforward Perl programming job
HETEROGENEOUS XML MERGES/QUERIES This is where namespaces becomes important