170 likes | 302 Views
TEI for Interactive Concordances: The New Menota Search System Øyvind Eide and Vemund Olstad Unit for Digital Documentation University of Oslo. The Menota network. Menota is a network of institutions working with medieval texts
E N D
TEI for Interactive Concordances: The New Menota Search System Øyvind Eide and Vemund Olstad Unit for Digital Documentation University of Oslo
The Menota network • Menota is a network of institutions working with medieval texts • 18 institutions in four countries (Sweden, Norway, Denmark and Iceland) members so far • Governed by a board with one representative from each country • All institutions meet annually for a council meeting to discuss status and challenges
Text publication • Anyone can add texts to the archive, but: • Texts have to comply to Menota encoding standards • Extended TEI P5 • Detailed encoding manual available. • Menota auhorization required for repository access • Texts added to repository will be available for browsing instantly, but scripting/processing by publisher required to add them to search/corpus
Search and display • Search form data filtered through Cocoon to CGI • CGI script queries Corpus Workbench (CWB) • CWB builds search result as TEI KWIC (P5 compliant) then returns it to Cocoon • Search result is filtered by Cocoon and various style sheets to a concordance list • Added concordance functions: Context-aware menu, link to text view/dictionary, etc.
Web frontend Static files (Handbook, minutes, news list etc). views Dynamic files (Menotic texts (HTML/PDF) – converted by xsl on the fly). Apache Cocoon searches Corpus Workbench (Menotic texts, converted to corpus format)
Search system (PERL) • Receives the search parameters from CGI • Reformats into CQP query • Sends the query to CWB API • Receives result objects • Reformats into TEI KWIC (P5) • Returns as HTTP reply
Corpus (CWB) • Uses a UTF-8 aware CWB 3.2 • Moderate size (currently ~ 1 mill. tokens) • Currently CGI based use only
The query format • Search submitted through standard web form • Form variables captured by Cocoon • New CGI search built by Cocoon sitemap • The CGI format (parameters and arrays of parameters) sent to PERL • Local format query formats (not standard based semantics)
TEI KWIC header (needs more work) <teiHeader><fileDesc><titleStmt><title type="main">Search result from Menota corpus</title><title type="sub">Searched for [word="Sæm.*" ] and had 28 hits.</title></titleStmt><publicationStmt><p>For internal system use in Menota system</p></publicationStmt><sourceDesc><p>Machine generated based on Menota corpus. More information about Menota can be found on the<ref target="http://www.menota.org/">Menota webpage</ref>.</p></sourceDesc><editionStmt><ab type="searchWord">Sæm.*</p><ab type="numHits">28</ab></editionStmt></fileDesc><encodingDesc><tagsDecl><namespace name="http://www.tei-c.org/ns/1.0"><tagUsage gi="w">The attribute n is used for the identification of a word within the Menota file from which it is retrieved.</tagUsage></namespace></tagsDecl></encodingDesc></teiHeader>
<body> <div type="corpus"><p>CGI hits: 10</p><p>Last: 10</p><p>Verkdel: </p><list><item><ref target="AM-63-fol"/><w ...>Gothormr</w><w ...>ſonr</w><w ...>Haralds</w><w ...>flettis</w><w ...>oc</w><w type="keyword" … > Sæmundr</w> <w ...>húsfreyia</w> TEI KWIC body <w ...>hann</w><w ...>atti</w><w ...>Jngibiorgu</w><w ...>dóttor</w> </item><item><ref target="AM-63-fol"/> ...</item><item><ref target="HolmPerg-17-4to"/> ...</item></list></div></body>
<w type="keyword" n="w49407" lemma="sem" me:msaX="CU" me:msaI="" me:msaG="" me:msaN="" me:msaC="" me:msaS="" me:msaR="" me:msaP="" me:msaT="" me:msaM="" me:msaV="" me:msaF="" me:msaE="" me:msaY="IN" context="[TEI][text xml:lang='onw'][body][div org='uniform' part='N' sample='complete' type='chapter'][p]">Sæm</w> The w element <w n="w49328" lemma="félagskapr" me:msaX="NC" me:msaI="" me:msaG="M" me:msaN="S" me:msaC="D" me:msaS="I" me:msaR="" me:msaP="" me:msaT="" me:msaM="" me:msaV="" me:msaF="" me:msaE="" me:msaY="" context="[TEI][text xml:lang='onw'][body][div org='uniform' part='N' sample='complete' type='chapter'][p]">fælagskap</w>
Serving external systems • Currently, searches must be in CGI format to get a TEI KWIC reply • This can be used by external systems • Would like a better format for searches • Standardised TEI KWIC format as part of P5? • Wider inter-operability: Export to Open Annotation format? To other formalisms?
Using external systems • If other systems would reply in TEI KWIC then we could integrate them in our searches • Must define merge operation on TEI KWIC • Include proposal for TEI KWIC format in guidelines?
The TEI KWIC document • Storing and using the concordances over time • Well documented link back to sources for each word • TEI KWIC document returned to user or available for download? • Versioning? • Publishing from TEI KWIC • WHY?
Thank you! http://www.menota.org/