Oral History: Users and their Scholarly Practices in a Multidisciplinary World

CLARIN Oral History: Users and their Scholarly Practices in a Multidisciplinary World CLARIN- EU 19-21 September 2018 München

CLARIN Preprocessing and textometry tools Florentina Armaselu, Luxembourg Centre for Contemporary and Digital History (C2DH), University of Luxembourg

CLARIN Preprocessing and textometry tools Summary Keywords: pre-process, analyse, interpret, evaluate • Overview of the session • Pre-processing data for textometric analysis • Short introduction (+ demo) to the textometric analysis with TXM • A few guidelines on the TXM hands-on experiment

CLARIN Preprocessing and textometry tools Overview of the session Goals • familiarise the participants with the textometric analysis and the TXM software; • encourage reflection on this type of analysis applied to (oral) history; • collect feedback on the use of language technology in (oral) history research. Session activities (2 hours) • Presentation + TXM demo (20 min.); • TXM hands-on experiment (1 hour and 20 min.); • discussions (10 min.); • evaluation (10 min.).

CLARIN Preprocessing and textometry tools Pre-processing data for textometric analysis - Sample corpus • BLACKIMMIGRANTSEN, interviews transcriptions from the collection Black immigrants to Britain, 1890-1975, UK Data Archive, Study Number 4936 (Thompson, P.). • 10 interviews, 1973, 1975; interviewees – 2 women, 9 men. • Key topics: arrival in Britain, family, leisure, religion, politics, marriage and children, education, prejudice and race riots, etc. • Transcriptions format: XML-TEI. • Metadata • Text

CLARIN Preprocessing and concordance tools Pre-processing data for textometric analysis –Workflow (1) to lower case (XSLT); (2) POS tagging + lemmatisation (TreeTagger) (1) identify speakers and speakers’ roles; (2) clean data (XSLT) transformed XML-TEI transcriptions XML-TEI transcriptions Metadata Text … where we used tolive in Sprules (?) Road … … she came up here and - to [missing] College …

CLARIN Preprocessing and concordance tools Short introduction to the textometric analysis with TXM What is textometry? • Methodology allowing quantitative and qualitativeanalysis of textual corpora, by combining developments in lexicometric and statisticalresearch with corpus technologies (Unicode, XML, TEI, NLP, CQP, R). What is TXM? • Open-source platform (Heiden et al., 2010) used for the analysis of large bodies of texts in various fields of the humanities (history, literature, geography, linguistics, sociology, political sciences) and allowing to: • import from different textual sources, e.g. raw text combined to flat metadata (CSV), raw XML/w+metadata, XML-TEI BFM; exports of results in CSV for lists and tables or in graphic format (SVG, JPEG, etc.) for diagrams; • manage NLP tools for processing the input files during the import process (e.g. Tree Tagger for lemmatisation and POS tagging); • build a sub-corpus or a partition based on metadata (date, author, genre, etc.) or structural units (text, section, etc.) of a corpus; • query for word and word properties patterns (via the CQP search engine); • build frequency lists, KWIC concordances and co-occurrence scores for words and words properties; • compute specificity scores for words/properties in a sub-corpus or a partition, progression/evolutionof patterns, correspondence factor analysis (CFA).

CLARIN Preprocessing and concordance tools Short introduction to the textometric analysis with TXM Create sub-corpus and partition using structural properties Compute specificity scores and draw diagrams Build concordances and visualise contexts at the document level Build queries and look for co-occurrences of words/properties

CLARIN Preprocessing and concordance tools Short introduction to the textometric analysis with TXM Specificities - probabilistic model (Lafon, 1980) using hypergeometric distribution formulae allowing to: • study the frequency distribution of words/properties in (sub-)corpus divided on several parts; • compare the parts, in terms of specific (excess/deficit) or basic use of words/properties. Specificities score (see also TXM Manual, 2015: &11.9; Bernard and Bohet, 2017: 68-78) • sign: (+/-) if the observed frequency fi(wk) is >/< than in a “normal” distribution (taking into account the size of part i compared to the whole); • value: magnitude order, e.g. score = 3 -> probability of the event ~ 1/103.

CLARIN Preprocessing and concordance tools A few guidelines on the TXM experiment Materials: • TXM tutorial; • tasks descriptions. During the experiment, please pay attention to the following aspects: • proposed tasks; • hypotheses (and eventually new questions) that may be formulated based on the observed linguistic phenomena in the studied corpus; • the role played by the language technology in formulating these hypotheses or new questions, and potentially its “added value” (if applicable); • possible limitations, bias, etc. of the approach or data sample; • general reflections on the application of this type of analysis to (oral) history research.

CLARIN Preprocessing and concordance tools References • Bernard, M. Bohet, B. (2017). Littérométrie. Outils numériques pour l’analyse des textes littéraires. Presses Sorbonne Nouvelle. • Heiden, S., Magué, J-P., Pincemin, B. (2010). TXM : « Une plateforme logicielle open-source pour la textométrie – conception et développement ». In Sergio Bolasco, Isabella Chiari, Luca Giuliano (Ed.), Proc. of 10th International Conference on the Statistical Analysis of Textual Data - JADT 2010 (Vol. 2, p. 1021-1032). Edizioni Universitarie di Lettere Economia Diritto, Roma, Italy. https://halshs.archives-ouvertes.fr/halshs-00549779/fr/. TXM Website:http://textometrie.ens-lyon.fr (accessed May 15, 2018). • Lafon P. (1980). Sur la variabilité de la fréquence des formes dans un corpus, Mots N°1, p 127-165. http://www.persee.fr/doc/mots_0243-6450_1980_num_1_1_1008. • TEI: Text Encoding Initiative. http://www.tei-c.org/. • TXM User Manual 0.7 - June 2015. http://textometrie.ens-lyon.fr/files/documentation/TXM%20Manual%200.7.pdf. • XML: Extensible Markup Language. https://www.w3.org/XML/. • XSLT: Extensible Stylesheet Language Transformations. https://www.w3.org/TR/xslt/all/.

CLARIN Preprocessing and concordance tools Ready for the TXM experiment? ;-)

Oral History: Users and their Scholarly Practices in a Multidisciplinary World

Oral History: Users and their Scholarly Practices in a Multidisciplinary World

Presentation Transcript

World History and World Geography: A Dialogue

iPods And Their History

Oral History Workshop

Oral History

Oral History

History of Scholarly Communications

iPods And Their History

Social Media Best practices in scholarly publishing

Oral History

Oral history

Choctaw Oral History

Oral History

Oral History Project

Connections: Scholarly Multi-Tasking in a Mobile Virtual World

Oral History

Oral Traditions in History

Oral History Interview

Oral History

Evaluating and Interpreting Oral History

Oral History in Interdisciplinary Contexts

Best Oral Health Practices for World Oral Health Day 2024