730 likes | 859 Views
XTF in Depth. Powerful Search and Display for Electronic Text. Martin Haye California Digital Library. January 2009 presentation at University of Sydney. XTF in Depth. Part 1: What is XTF and how does it compare? Who is using it? What needs does it address? New features in 2.1
E N D
XTF in Depth Powerful Search and Display for Electronic Text Martin HayeCalifornia Digital Library January 2009 presentation at University of Sydney
XTF in Depth • Part 1: • What is XTF and how does it compare? • Who is using it? • What needs does it address? • New features in 2.1 • Design and data flow • Adapting Lucene and Saxon • Planned improvements • Part 2: • Interactive demos
XTF in 5 minutes • eXtensible Text Framework • Search and display technology from CDL • Open-source Java framework • Powerful and highly configurable • All about rapid prototyping, fast deployment, and incremental improvement • XML + Full text search • Also indexes PDF, HTML, Word • Excel and Powerpoint coming soon
XTF in 5 minutes • Search: Query power/speed of Lucene, plus: • search results shown in context • keyword search, facets, spelling, lots more • View: Processing power of Saxon, plus: • large file optimizations, hit markup • Configure and customize exclusively in XSLT • Flexible, overlapping collections • Mature, tightly integrated, well documented • In use at CDL and many other places
What XTF is not • It is not a content management system • Creation (conversion, scanning, manual) • Ingest / administration • Editing • Preservation • Not built for remote administration • Not a true XML database • but close • Not Google • Google: one interface to vast grab-bag of data • XTF: crafted interfaces to high-quality data sets
How does XTF compare? Green- stone * * Solr Turn-key / easy---------------> XTF 2.1 XTF 2.0 Customizable / Powerful ----------------------------------------> * caveat: based on my limited experience with Greenstone and Solr
Needs • Let’s look at four needs that XTF was created to address: • Diverse data • Open software • Rapid deployment • Community involvement
Needs: 1. Diverse data • Our collections: many and diverse • eScholarship (TEI, PDF) • UC Press monographs (a text may be > 10 megs) • 25,000 scholarly articles in PDF • Mark Twain • Hand-crafted critical edition (TEI + MODS) • OAC: finding aids, images, books, manuscripts • Japanese American Relocation Digital Archives • TEI, EAD, MODS • Book scanning projects (Google, Internet Archive) • Thousands of scanned books (PDF + DC) • Millions of Melvyl catalog records (MARC)
Needs: 2. Open software • Digital Publishing Products • “Black box” (no control over fixes & features) • Often not standards-based • Tech companies have short lifespans • Support often spotty • Data can be held hostage, or even lost • $$$$$
Needs:3. Rapid deployment • New collections arriving • Users don't want to wait a year for access • Many “what if” and “wouldn't it be cool” requests from our staff • Java programmers are expensive • Look & feel goes stale quickly • Barrage of feature requests
Needs:4. Community involvement • We want to share the load • For XTF 2.1, we asked the XTF community to vote for features they wanted • At CDL we try to align our development to needs of the community • Result: Everybody benefits
New and improved in 2.1 • Faceted browse • Search flexibility • Bookbag • Spelling correction • Similar items • OAI-PMH
Faceted browse • Previously implementing faceted browse required lots of XSLT programming. • Hierarchical facets: even harder • Required us to deeply refactor the stylesheets, but now it’s simple to add new facets.
Search flexibility • Keyword search: single box (now default). Internally, searches multiple fields. • Advanced search: explicitly fill in constraints for various fields • Freeform search (new): text-based field specifiers, AND, OR, parentheses, etc.
This fit nicely into XTF’s architecture Simple but conforming implementation OAI-PMH
Bookbag • Refactored the AJAX to use YUI (Yahoo User Interface widgets) • Still session based • Now supports emailing the bookbag
Spelling correction • Unicode bug fixes • On by default and fully integrated
Similar items • Allows user to see “more like this” • Improved AJAX integration • On by default - no configuration needed
Other changes in XTF 2.1 • Built-in NLM “Blue”, TEI P5, MS Word support (still support TEI P4, EAD, PDF, HTML, text) • Valid XHTML output • RawQuery servlet to provide a query back-end to a (e.g. Ruby) front-end or mash-up. • Bug fixes and minor changes (many reported/requested by users)
Design philosophy • Adaptation through programming • XTF is still about building what you want using a set of powerful tools But now: • Stylesheets are more modular • Build interfaces faster using honed widgets • Prettier UI to start with
XTF is open, standards based • Based on free, open-source tools: • Java SDK 1.5+ • Lucene 2.1 full-text search toolkit • Saxon 8.9 XSLT processor • UNICODE support throughout • XTF itself is open-source (BSD license) • No native code – pure Java and XSLT 2.0 • Runs on Windows, Solaris, Linux, MacOS • Drops right in to Tomcat or Resin • Lots of user-fixable documentation
Modular • Use crossQueryservlet to search, dynaXML to display and navigate. Deploy one or both. • Stylesheets govern flow of data – no Java programming required • Easy to add features incrementally • 100% configurable “look and feel” • Skin & slice: one system can have several interfaces and multiple “brands” • Collection subsetting driven by meta-data