60 likes | 183 Views
Rencontres TEI Council Lyon 2009. Serge Heiden ICAR Laboratory / Lyon University slh@ens-lsh.fr. Council, ENS-LSH, Lyon (France), 1 April 2009. Context (1/2). Project objective (2007-2009) : To develop an open-source software platform for Textometry analysis of textual data Partners :
E N D
Rencontres TEI CouncilLyon 2009 Serge Heiden ICAR Laboratory / Lyon University slh@ens-lsh.fr Council, ENS-LSH, Lyon (France), 1 April 2009
Context (1/2) • Project objective (2007-2009) : To develop an open-source software platform for Textometry analysis of textual data • Partners : • Univ. of Lyon (lead) [Weblex] • Univ. of Nice [Hyperbase] • Univ. of Franche-Comté [Diatag] • Univ.of Paris 3 [Lexico] • Univ. of Oxford [Xaira] • Univ. of Montréal [Sato] • Web sites : • http://textometrie.ens-lsh.fr (project site) • http://textometrie.sourceforge.net (dev site) • And others : • Univ. of Chicago[PhiloLogic]
Context (2/2) • Textometry methodology • TEI encoded and NLP enriched textual data analysis • Qualitative data analysis • Deep Text Search Engine, kwic concordances • Hyper Textual data rendering and navigation • Quantitative data analysis • factorial analysis, classification, specificity • N-gram analysis, cooccurrence, collocation, burst
TEI Role and Usage • Open-source contract between data and software • Textometry point of view for data input from TEI : • Textual dimensions (main language, secondary language, cited text, out of text - comments, notes, titles…) <index> • Lexical units (words, phrases…) and their properties (pos, lemma…) <w> • Contextual units (sentence, verse, chapter, text…) and their properties (language, number, domain, genre…) <s> • Contrasts between units • Structural units (navigation : physical - page, logical) <pb/> • References (unit coordinates based on their properties) • Rendering (device, segmentation, style) • Alignment (between two corpora)
Discussion (1/2) : Textometry related TEI element types(BFM : A. Lavrentiev) • Tokenize words (segment + value) • >= : expan|note|name|s • = : w|abbr|num • < : c|ex • Segment sentences (segment + value) • > : TEI|text|front|body|div|head|trailer|p|ab|sp|speaker|list • >~ : q|quote|item • Transversal : • ~ : choice|corr|sic|add|del|reg|orig|foreign|hi|title|supplied|subst|damage|pb|lb|milestone|gap • Meta : note, teiHeader • Primary linguistic content of a text : index ? • NLP results : specify stand-off
Discussion (2/2) : Software related information • bind software parameters to TEI texts • meta.xml file of the ODT format • corpus_parameters.xml of Xaira software => external pointer in teiHeader (like image or audio files) ?