340 likes | 469 Views
Release the power of your XML markup!. X ML A ware I ndexing & R etrieval A rchitecture. http://www.xaira.org. What are digital resources actually for ?. integration of disparate sources texts, commentaries, sources, variations… multimedia, manuscripts, transcriptions, metadata…
E N D
Release the power of your XML markup! XML Aware Indexing & Retrieval Architecture http://www.xaira.org
What are digital resources actually for ? integration of disparate sources texts, commentaries, sources, variations… multimedia, manuscripts, transcriptions, metadata… a new way of preservation media disappear, data remain "multiplication beyond the reach of accident" a huge expansion of accessibility quantitative … and also qualitatitive
.. And are we delivering? integration of disparate sources Different user communities have different -- and sometimes contradictory -- agendas and priorities a new way of preservation The business model is still unclear: is digitization a public good? And the technical problems may be insuperable a huge expansion of accessibility Implies a huge expansion of metadata provision But what about qualitative expansion?
The majority view Libraries collect and distribute Texts What is a text? It’s a book Or an electronic simulation of one Texts are cultural artefacts, which need to be treated like other cultural artefacts
This is good news for … Librarians and other cultural guardians An emergent new class of textual editors Students in search of easy dissertation topics Antiquarians of all sorts (until the books run out)
A minority view… • What is a Text? • The book is not the text! • Page images are not the text! • Markup is not the text! • The text is all IN YOUR HEAD • It’s an avatar for the language • Language is more interesting than texts
Remember the concordance? Defamiliarizes and decontextualizes the components of a text Facilitates analysis of Lexis, syntax, and lexical patterns Co-occurrence, collocation, colligation Informed by metadata categorization and acculated interpretation A way of reading a text in its context as a means of discovering its primings
From SARA to XAIRA… SARA (SGML-Aware Retrieval Application) was BNC-specific XAIRA (XML-Aware Indexing and Retrieval Architecture) is a generic XML Corpus searching toolkit Currently available for MS Windows only Open source version under development Join our beta test at http://www. xaira.org Object oriented design
XAIRA: the key features Supports word search, concordance generation and manipulation, collocation, lexical analysis Uses XML annotation to the max and supports XML-aware complex queries Leverages existing standards TEI/XCES Unicode CSS and XML Web services Uses efficient and compact indexing appropriate to small or huge corpora
First catch your corpus… any collection of well-formed XML documents if a DTD is supplied, the corpus must be valid if no TEI header is present, one will be created the more you put in, the more you get out "texts" are defined independently of file structure, as are the relevant units within them all indexing information is stored in the corpus header
Next, build your index… Can be done simply by adding appropriate declarations to the TEI Header and running the indexer utility But probably easier to do with the supplied Indextools utility which organizes and validates the files you are using updates (or creates) the header with tokenization and indexing rules tag and attribute usage, descriptive codebooks etc. "bibliographic" metadata default behaviour for character encoding, formats used, etc optionally runs and tests the indexer And there is also a wizard….
What goes in the index? • tokenization • implicit, following Unicode rules (locale-sensitive) • explicit, following mark up • supports lexical features (eg collocation) • lemmatization and POS tags • special case of "additional key" mechanism • generalized to provide fast context-specific searches • tag indexes • attribute values and codebooks
lexica TEI Header index WSDL client PC client Web client The architecture corpus Xairo (or χρ) : the Xaira Object model
Hoorah for Unicode All data is held internally as Unicode this allows us to defer most problems (e.g. tokenization, case-folding, line-breaking, character normalization, glyph composition) to someone else! User interface issues For output, use one or more appropriate fonts For input, we provide a keyboard definition utility Unicode rules can also be modified…
Client/protocols The original SARA protocol Corpus Query Language Ad-hoc ASCII strings Now revised completely Sara Object Model can be accessed directly by the client as a web service using saraScript The model defines an XML-corpus Query Language (XQL) methods to manipulate XQL queries and results
XMLQuery Language • XML vocabulary for searching • word, punctuation mark, substring • word + secondary keys (e.g. POS) • XML start- or end-tag, plus attributes • Unicode-compliant regular expressions • Including • The usual Boolean operations • sequence, disjunction, join • Scoping operators • Special lexical features • Not Xpath, not Xquery…yet
Client features • Word and lemma query • User-configurable display • plain, XML, user-defined stylesheets • Texts, Results, Browse windows • Results can be exported in XML • Scripting language • “visual interface” for complex queries
Target queries What is the most frequent noun in this corpus? Find a random sample of 100 instances of "fish" followed by "chips" within 4 words Find sentences beginning with a conjunction. Show all inflected forms of the name "Winston". Show sentences which begin with "well" and end with a question mark. How often and in what contexts is the word "nature" used in different kinds of writing? Which verbs collocate significantly with "bosom" at different periods of history? Do men use colour vocabulary differently from women?
Phrase or simple query search word or phrase can be case sensitive can include punctuation can include anyword character watch out for tokenization problems
Word Query searches the lexicon for a word stem or pattern returns matching word forms with frequencies can restrict by frequency can apply lemmatization rules .. then carries out a lookup to display hits
Collocation query • Compares frequency of a given word or lemma with frequencies of all others appearing within a specified context • Ranks results by a statistic (Mutual Information or Z-score) indicating the degree of collocational strength • Can be very suggestive…
Partitions A partition is a way of grouping the texts making up a corpus, according to some explicit annotation or characterization (e.g. an attribute value) according to whether or not they match a query (a partition of two halves) arbitrary manual classification Each member of a partition is a discrete text Analysis shows the rate of occurrence of hits within members of the partition Partitions can be saved and re-used or defined dynamically indextools generates a default partition using <catRef> element
Collocates of love spoken demographic texts fiction
Building complex queries visual interface scope node defines where to look an XML element by span query nodes define what to look for word, phrase, addkey, pattern, XML link types define sequence in which query node targets should occur next, one-way, two-way
What is XAIRA's niche? Web search engines patchy and unknowable coverage designed to recover content, not word forms hard to cite, harder to process XML display engines expensive, geared to reader not searcher focus on presentation rather than content As a back end for your next generation web application
BNC Baby • Demo CD available here and now • Contains • Four million-word samples from BNC • Nameless Shakespeare • Brooklyn Corpus of Old English • Xaira release 1.08 • Sign up here!