120 likes | 223 Views
XSL, Swish-e and DjVu. Kevin Reiss Rutgers-Newark School of Law Library March 10 th , 2004 TAG Meeting. Project Description: New Jersey Digital Legal Library . url: http://njlegallib.rutgers.edu Create a searchable & browsable repository of previously unavailable NJ Legal Information
E N D
XSL, Swish-e and DjVu Kevin Reiss Rutgers-Newark School of Law Library March 10th, 2004 TAG Meeting
Project Description: New Jersey Digital Legal Library • url: http://njlegallib.rutgers.edu • Create a searchable & browsable repository of previously unavailable NJ Legal Information • 3 Collections: • New Jersey Administrative Reports, 1979-1991 • New Jersey Executive Orders, 1941 – January 1990 • New Jersey Attorney General Opinions • Collection 1 scanned professionally by Princeton Imaging, 2 & 3 done in house on a flatbed Minolta PS7000 • OCR quality is good in Collection, poor in 2 and 3 • Available in PDF and DjVu [with embedded OCR text] • DjVu created with LizardTech Document Express 3.1 • PDF created with c42pdf [http://c42pdf.ffii.org/]
Project Requirements • Use only open-source tools [other than for document creation] • Need to provide full-text searching and searching within specific metadata fields • Documents need to be indexed and retrieved as atomic units, rather than at the page-level • Solution: • Store the metadata and full-text of each document in the same unit and find an indexing program that can index them both. • Ultimate solution: • Extract OCR text from DjVu files using djvutoxml • Use XSL to combine djvutoxml output and metadata in xml in a single XHTML file • Use swish-e to index and search the XHTML file
Swish-e Basics • url: http://www.swish-e.org/ • Simple Web Indexing for Humans – Enhanced • Full-Text indexing program written in C, available freely • Special indexing modes for XML, HTML documents, can index any plain-text format • Uses standard open-source filtering tools to index ps, pdf, word, and ps.gz documents • Can index both file-systems and over HTTP • Supports several stemming algorithms • Supports Boolean searching • Supports wildcard and phrase searching • Indexing controlled by standard configuration file format • Uses libxml to parse XML|HTML documents
Why Choose Swish-e • It can index and search HTML metatags • It is fast, index several thousand files in a few seconds • Decent compression in the index app 700 pages with metadata results in a 13.5 mb index • Swish:API, a perl module for embedding swish-e in applications available • This module forms the basis of a fairly functional demo web-based search app that can be used to build your own search interface • Easy to select the meta or xml tags you wish to index and return with search results using the “metatag” and “property” declaration in the swish-e config file • Excellent documentation [http://www.swish-e.org/current/docs/] • Under active development, version 2.4.2 just released yesterday
XSL Basics • Extensible Stylesheet Language [http://www.w3.org/Style/XSL/] • Really two W3C XML standards • XSLT: a transformation language for XML documents • XSL-FO: a powerful language for specifying formatting semantics, much more powerful than CSS, generally used for print publications • Written as well-formed XML • Some predict it will take on SQL-like functionality for XML Documents • Based on the paradigm of functional programming • XSLT transformations are executed using an XSLT processor • Many Java-based XSLT Processors • I use libxml [http://xmlsoft.org/], a very based C-based library that includes an XML parser and XSLT processor • Takes an XML document as input and transforms this into XML, HTML, or plain-text output • The instructions for this transformation are located in XSLT stylesheets • Transforms one tree to another
XSL Syntax Basics • Stylesheets are constructed of a series of “templates” that match nodes or groups of nodes in an XML document • Example: main XSL stylesheet for djvu2xhtml conversion • Groups of nodes are selected by written XPATH expressions • XPATH is another W3C standard [http://www.w3.org/TR/xpath] • Purpose “a language for addressing parts of an XML document” • Has a number of familiar procedural constructs: looping, branching, named variables • Example of variables: parameter stylesheet for djvu2xhtml • Some problems: • Can be slow for large documents [whole document is loaded into memory • Multiple input and output documents are clunky • String processing is problematic, no regexes, typically need to use recursively structures for complicated tasks
DJVUXML tools • Part of DjVulibre 3.5.12 or higher • URL: http://djvu.sourceforge.net/doc/man/djvuxml.html • Does djvused-like (annotations, highlighting) functions using XML syntax • Djvutoxml outputs an XML serialization of a DjVu Document • Example Output – results in very large files • This reflects line, page and column information, can vary quite a bit from document type to document type • Unrecognized OCR often results in Unicode errors, so use the provided xml2utf8 or xml2utf16 filters • Provides you with a set coordinates for regions in a DjVu document contrary to what the plug-in understands
Workflow • Prepare metadata in XML • Available in a format based on partly Dublin core, part in-house tags • This was extracted from static HTML pages • Prepare customized metadata and display information for the documents to be transformed: example • I use emacs nxml-mode for editing XML documents • Invoke DjVuXML commands • Transform documents to XHTML: example • Prepare Swish-e index • Put in meta and properties information in config file • Prepare Search Interface • Put in meta and property information in cgi interface config • Put in display related meta and property information in search template file
Problems • Use of space could develop into an issue • XSLT transformations using the djvuxml format are too slow to be used in any real-time processing, must be done in batch • Updating or adding metadata must be done by hand or by program, no data entry interface • Swish-e has limited support for indexing XML attributes • Swish-e can only index specific fields in XML documents that are defined as properties • To enable highlighting in DjVu Documents will need to solve the coordinate problem • Complicated modifications to the search interface are time consuming and require you to learn on of the perl HTML template mechanisms, like Template::Toolkit or HTML::Template
Future Directions • Explore fully Aware XML indexing engines • Amberfish • eXist – example Apps, based on XQuery • Xindice • Search Interface Improvements • Take the user directly to their keyword in the document • Dynamically generate the browsing pages for the collection based on information in the metadata files [currently static HTML] • DjVuXSL Stylesheet Improvement • Work on string processing capabilities to recognize paragraphs, lists • Rework the use of the document() to improve processing speed • Try XSLT 2.0, to see if the new string processing capabilites can help • Learn more about the structure of DjVu documents to make the stylesheets more reliable
Useful Links • DjVuXSL • DjVuXSL Stylesheets Homepage • Guide to Dublin Core in HTML • Swish-e • Current Swish-e Documentation • XSL • XSL-List • Jenni Tennison's XSLT Pages • Book: XSLT Programmer's Reference • XSLT 1.0 Tutorial • XSLT 2.0 Introduction • XSLT 2.0 Implementation