1 / 12

Oral History: Users and their Scholarly Practices in a Multidisciplinary World

Oral History: Users and their Scholarly Practices in a Multidisciplinary World. CLARIN- EU 19-21 September 2018 München. Preprocessing and textometry tools. Florentina  Armaselu, Luxembourg Centre for Contemporary and Digital History (C 2 DH), University of Luxembourg.

cdesjardins
Download Presentation

Oral History: Users and their Scholarly Practices in a Multidisciplinary World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CLARIN Oral History: Users and their Scholarly Practices in a Multidisciplinary World CLARIN- EU 19-21 September 2018 München

  2. CLARIN Preprocessing and textometry tools Florentina  Armaselu, Luxembourg Centre for Contemporary and Digital History (C2DH), University of Luxembourg

  3. CLARIN Preprocessing and textometry tools Summary Keywords: pre-process, analyse, interpret, evaluate • Overview of the session • Pre-processing data for textometric analysis • Short introduction (+ demo) to the textometric analysis with TXM • A few guidelines on the TXM hands-on experiment

  4. CLARIN Preprocessing and textometry tools Overview of the session Goals • familiarise the participants with the textometric analysis and the TXM software; • encourage reflection on this type of analysis applied to (oral) history; • collect feedback on the use of language technology in (oral) history research. Session activities (2 hours) • Presentation + TXM demo (20 min.); • TXM hands-on experiment (1 hour and 20 min.); • discussions (10 min.); • evaluation (10 min.).

  5. CLARIN Preprocessing and textometry tools Pre-processing data for textometric analysis - Sample corpus • BLACKIMMIGRANTSEN, interviews transcriptions from the collection Black immigrants to Britain, 1890-1975, UK Data Archive, Study Number 4936 (Thompson, P.). • 10 interviews, 1973, 1975; interviewees – 2 women, 9 men. • Key topics: arrival in Britain, family, leisure, religion, politics, marriage and children, education, prejudice and race riots, etc. • Transcriptions format: XML-TEI. • Metadata • Text

  6. CLARIN Preprocessing and concordance tools Pre-processing data for textometric analysis –Workflow (1) to lower case (XSLT); (2) POS tagging + lemma- tisation (TreeTagger) (1) identify speakers and speakers’ roles; (2) clean data (XSLT) transformed XML-TEI transcriptions XML-TEI transcriptions Metadata Text … where we used tolive in Sprules (?) Road … … she came up here and - to [missing] College …

  7. CLARIN Preprocessing and concordance tools Short introduction to the textometric analysis with TXM What is textometry? • Methodology allowing quantitative and qualitativeanalysis of textual corpora, by combining developments in lexicometric and statisticalresearch with corpus technologies (Unicode, XML, TEI, NLP, CQP, R). What is TXM? • Open-source platform (Heiden et al., 2010) used for the analysis of large bodies of texts in various fields of the humanities (history, literature, geography, linguistics, sociology, political sciences) and allowing to: • import from different textual sources, e.g. raw text combined to flat metadata (CSV), raw XML/w+metadata, XML-TEI BFM; exports of results in CSV for lists and tables or in graphic format (SVG, JPEG, etc.) for diagrams; • manage NLP tools for processing the input files during the import process (e.g. Tree Tagger for lemmatisation and POS tagging); • build a sub-corpus or a partition based on metadata (date, author, genre, etc.) or structural units (text, section, etc.) of a corpus; • query for word and word properties patterns (via the CQP search engine); • build frequency lists, KWIC concordances and co-occurrence scores for words and words properties; • compute specificity scores for words/properties in a sub-corpus or a partition, progression/evolutionof patterns, correspondence factor analysis (CFA).

  8. CLARIN Preprocessing and concordance tools Short introduction to the textometric analysis with TXM Create sub-corpus and partition using structural properties Compute specificity scores and draw diagrams Build concordances and visualise contexts at the document level Build queries and look for co-occurrences of words/properties

  9. CLARIN Preprocessing and concordance tools Short introduction to the textometric analysis with TXM Specificities - probabilistic model (Lafon, 1980) using hypergeometric distribution formulae allowing to: • study the frequency distribution of words/properties in (sub-)corpus divided on several parts; • compare the parts, in terms of specific (excess/deficit) or basic use of words/properties. Specificities score (see also TXM Manual, 2015: &11.9; Bernard and Bohet, 2017: 68-78) • sign: (+/-) if the observed frequency fi(wk) is >/< than in a “normal” distribution (taking into account the size of part i compared to the whole); • value: magnitude order, e.g. score = 3 -> probability of the event ~ 1/103.

  10. CLARIN Preprocessing and concordance tools A few guidelines on the TXM experiment Materials: • TXM tutorial; • tasks descriptions. During the experiment, please pay attention to the following aspects: • proposed tasks; • hypotheses (and eventually new questions) that may be formulated based on the observed linguistic phenomena in the studied corpus; • the role played by the language technology in formulating these hypotheses or new questions, and potentially its “added value” (if applicable); • possible limitations, bias, etc. of the approach or data sample; • general reflections on the application of this type of analysis to (oral) history research.

  11. CLARIN Preprocessing and concordance tools References • Bernard, M. Bohet, B. (2017). Littérométrie. Outils numériques pour l’analyse des textes littéraires. Presses Sorbonne Nouvelle. • Heiden, S., Magué, J-P., Pincemin, B. (2010). TXM : « Une plateforme logicielle open-source pour la textométrie – conception et développement ». In Sergio Bolasco, Isabella Chiari, Luca Giuliano (Ed.), Proc. of 10th International Conference on the Statistical Analysis of Textual Data - JADT 2010 (Vol. 2, p. 1021-1032). Edizioni Universitarie di Lettere Economia Diritto, Roma, Italy. https://halshs.archives-ouvertes.fr/halshs-00549779/fr/. TXM Website:http://textometrie.ens-lyon.fr (accessed May 15, 2018). • Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un corpus, Mots N°1, p 127-165. http://www.persee.fr/doc/mots_0243-6450_1980_num_1_1_1008. • TEI: Text Encoding Initiative. http://www.tei-c.org/. • TXM User Manual 0.7 - June 2015. http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf. • XML: Extensible Markup Language. https://www.w3.org/XML/. • XSLT: Extensible Stylesheet Language Transformations. https://www.w3.org/TR/xslt/all/.

  12. CLARIN Preprocessing and concordance tools Ready for the TXM experiment? ;-)

More Related