Normalization pre-processing procedures for Newsgroup corpora

Normalization pre-processing procedures for Newsgroup corpora Simona Colombo

Focus points: Pre processing Standard Operation Markup specification Tokenization Strategy approach Database Resources

Machine-readable definition “a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.” «C’è sempre stata una linguistica basata sullo spoglio di materiali linguistici, anche molto copiosi, ma con linguistica dei corpora, traduzione dell’inglese corpus linguistics, si intende oggi quella branca della linguistica che si occupa di elaborare i dati provenienti da larghi insiemi di testi immagazzinati su supporti leggibili dal computer. È dunque una linguistica dei corpora elettronici [...]» (Marello 1996, p. 167

Pre processing Standard Operation To define a text, also in machine-readable form, as a corpus is mandatory to apply on it the tokenization and elementary markupping The tagging operation otherwise is not necessary, as we can see on raw corpora «in corpus-driven linguistics you do not use pre-tagged text, but you process the raw text directly and then the patterns of this uncontaminated text are able to be observed» (Sinclair 2000, p. 36).

Tokenization Usually we use tokenization to discover on the left and on the right of a token the blank characters, isolating the atomic units useful for the automatic processing These token often don’t match with the typographicword so is evident the difference between token and word. Tokenization is a set of tools used to isolate each token as a significant part of the text

«The isolation of word-like units from a text is called tokenization» (Grefenstette - Tapanainen 1994, p. 79) «token means the individual appearance of a word in a certain position in a text. For example, one can consider the wordform dogs as an instance of the word dog. And the wordform dogs that appears in, say, line 13 of page 143 as a specific token» (Grefenstette 1999, p. 117; cfr. anche Mikheev 2003).

Markup The markupping is the upper level of information given not simply from the token’s sequence. This notice can be a emphasis form, usually marked with typographictopic such as bold or underline, or editing properties, such as page number, paragraph notation. In the meanwhile the punctuation is isolated at the tokenization level and not in the not in the markupping

Weakly embedded markup: Author , title, chapters, paragraphs, pages, lines... Strongly embedded markup: Embedding of a text, typology of prose, poetry

Newsgroup The linguistics resources available on the net are marked with some specific troubles linked to the informal style of the text and the arbitrary use of the linguistic rules. The newsgroup on which are implemented the NUNC (Newsgroup UseNet Corpora) – www.corpora.unito.it are a wonderful sample of this kind of peculiarity. The linguistics approach to the Newsgroups puts in evidence some relevant “noise” that has to be considered and managed working on these kind of text to build a corpus.

Newsgroups text peculiarity • Non standard encoding • Formatting mistakes • Several kind of spelling • Acronyms and abbreviations • Spamming • Non textual attchments • Repeated text • Html including • Out of topics • Emoticons • Quoting • Web art

We can see the newsgroups registry as a non standard text. In the corpora pre processing there are a lot of studying and procedure to approach this kind of text, with the intent to mark the non standard part of the text to avoid the crash of the automatic procedure. Even the tools used to implement the NUNC, the IMS tools for corpus building and query, presents some features to let the possibility to isolate some part of the text that will be ignored from the CQP encoding avoiding troubles or crashing. Some important researchsuch as CleanEval goes in this direction, the detection in a web pages (considering the web as a corpus resources) of the “dirty” and non textual part such as boilerplater , structural annotation, code is done to let the building of the core of the corpus including only the part that ca be considered “text”. Instead in the NUNC is fundamental to detect and mark a lot of “non standard” part or sequence not to ignore in the corpus processing but to exactly work and manageinside the corpus itself.

Academic steps Line numbering Markupping both editing specification and meta information Tokenization

NUNC peculiar step Filtering Text selector to control repeated text Text cleaning (non standard encoding, attachment, spam) Emoticon and web art detection and marking Posting markup to evidence the message part tread structure Formatting Text reformatting to avoid editing mistakes

Spamming Problem With spam we indicate the same message sent many times in different groups, out of topis with the newsgroup subject, usually advertisementor sellers. Even if the mail server has without doubt an anti-spam algorithm, is really difficult and expensive the filtering of this kind pf messages. In the corpus building is relevant to isolate and mark this kind of messages mainly for three reasons: • The text spam is “dirty” in the context of the tread, so in the linguistic approach to the corpus is important to filter this “false” textual information • The spam messages are usually full of attachment in form of images, html, executable programs, so are messages with particular kind of text that create big troubles to the encoding software. • The big amount of this messages in the whole corpus, see for example your personal email account….

Subject Filtering: spamming messages presents the same subject across the different newsgroup Spam Message ID Filtering: spamming messages if sent all together from an automatic sender has the same message id across the newsgroup Post length: the spam message with announcements or advertisementusually are really shorts, so we have filtered the messages with less the 15 lines Cross posting: if in the destination groups of a message there are too much newsgroup (we have adopt a configurable over limit of 15) the massage could be a cross posting spam message

“Dirty” characters In the messages could be a lot of reason that let “dirty” characters. It’s important to find them and mark them as “extra testo” to avoid the encoding program crashing. This character has been detected following two different approches: • Reencoding of the different ASCII format, considering unicode different translation • Analysing different “dirty” messages we have isolated some typical patterns specific for images, html, executable programs, and we have designed perl script to find and mark this pieces of post.

Editing problems Using the newsgroup from different platform is possible that the different editorschanges the structure of the original messages. So, in the specific intent to mark also the typographic behaviour of the text, identifying the lines, the empty lines and the quoting, is important to rebuild the original structure of the post whit the correct editinginformation

Repeted text The newsgroup text format itself is based on the quotingphenomenon that create the repetition of the text. If all the repeated text is used in the quotated part of a message it will be simple to isolate this text and don’t consider its in the frequency count. But the variety of writer and approach to this kind of communication let the possibility, often used, to refer part of messages yet written without using the quoting strategy. So becomes really relevant to mark the repeatedtext to avoid “false” counting information.

The discover of this text via usual script algorithm base on regularexpression has two big troubles, the first one is the slowness of the procedure, the second one is the “theoretical definition” of the length of the text that is mark as repeated , on this definition is base the regular expression criteria for the algorithm. Really different is to find a sequence on n-word, a sequence of line, a sequence of paragraph… So we have adopted a workaround to be sure that this phenomenon is under control even if we have “lost” some part of “good” text.

Workaround Considering that all the messages are recorder in order we have implemented a script that looks into a tread for the longestmessage, using only one message for tread, including the quoting part. In this way we try to preserve the most quantity of text, preserving also the textualinformation given from question and answer mechanism. Implementing the algorithm some particular attention has been used to filter the tread that are recorded on the newsgroup server non from the first but starting from an answer.

Markup Formatting markup Header markup

Emoticon In the newsgroup text there is a large use of “emoticons”. An emoticon is a symbol or combination of symbols used to convey emotional content in written or message form. Is really important to recognize this set of characters and mark them in the right way and not as a sequence of punctuation. Is necessary to implement an algorithm that mark the difference between a simple “…” at the end of a text from .

Pattern sequence The original intent of the parser was to build an emoticonlibrary that grown with the grown of the corpus in which store all the emoticons of the texts. Implementing this solutions we realize that is really difficult to isolate a set of emoticons because there are a lot of graphical symbol and a lot of variation on the same. For example  the simplest one written as : - ) can be also : - )) or : - ))) and so on, so we have implemented a pattern recognition, using the regular expression, to find them

Newsgroup tokenizer The tokenizer implement a common part equal to every language that mark, using a pattern recognition , the following: Calendar date Phone numbers Genitive, auxiliary Abbreviation E-mail , URL, news address Numerical expression Word separated from -,/ Capital letter with dot

The pattern recognition is interactive. The input is a line to parse, step by step, the line is splitted isolating the different rules of tokenization, starting with the one on which is easy to define haw separate the token ending with the more complex pattern. The output is word for line format.

Italian Tokenizer Apostrophe has a specific pattern. Is important distinguish from accent, quote, english interference considering also a lot opf variant in the writing form, for example I’m or I’ m or I ‘m….

Spanish tokenizer The changes from italian one is the different use of punctuation, also at the start of a text, the use of ¡¿ , the different list of abbreviation the accented characters within a word and not only at the end

English tokenizer The main differences are the data pattern definition And the genitive and auxiliary

Metadata and POS

LULCL la prima giornato non ero ancore capace POS ART ADJ NOM ADV VVFIN ADV ADJ Lemma l- primo giornata non essere ancore capace Query strategy • Metadata: Corpus, Documento, Testo, Autore • Annotation: PoS e Lemma • Markup

Syntacticquery Accordo(aggettivicheterminano in –a + sostantivochetermina in –o, in testidiapprendenti F con menodi 18 anni) [pos=“ADJ” & word=*.+a”] [pos=“NOM” & word=*.+o”] &int(_.autore_eta_min)<18)::match.autore_specifiche=“f” • alla fine del diecinovesima secolo , per gli Stati-Uniti . • la sua mano c ' era un sachetto • è in barca e che è con la sua capo • ma non sa che è x la sua capo .

Corpus with a large set of metadata Each information searchable Query construction for single filter or combined The metadata present may be integrated with additional information or links to other databases Using DB asquerytool

<HEAD> <doc-id> <idN>1589</idN> <charset>ansi</charset> <lingua>italiano</lingua> <aut_NC> Mariane,Schrei </aut_NC> <fornitore> Elisa,Suriani; </fornitore> <trascr> Francesca,Magistro </trascr> <data> 2005,01,25 </data> <luogo> Colonia,DE </luogo> <ist> scuola </ist> <ist_nome> Universitaet zu Koeln </ist_nome> </doc-id> <set-id> <corpus>valico</corpus> <gruppo_num> 19,gn </gruppo_num> <gruppo_nome>amoreKOELN</gruppo_nome> </set-id> <autore> <specifiche>f</specifiche> <eta>19-25</eta> <status>2</status> <annualita>3</annualita> <lingua1>tedesco</lingua1> <lingue>inglese</lingue> <scolarizzazione>un</scolarizzazione> <permanenza>84,Biella</permanenza> <esposizione>am,fam,med;?</esposizione> </autore> <testo> <tipo_forma>c-lib_desc</tipo_forma> <tipo_produzione>priv</tipo_produzione> <topics>...</topics> <keyw>(____,____,____,____,____)</keyw> <test>0</test> <qualita>orig</qualita> <esecuzione>wp</esecuzione> <cap-min>0</cap-min> </testo> <ref> <stel>alessandrasalvan_FT.txt,0,0</stel> <cons>stazione_C.txt</cons> </ref> </HEAD> CQP: <id N=1589> <charset value=ansi> <lingua value=italiano> <aut_NC value="Mariane,Schrei"> Testo </aut_NC> </lingua> </charset> </idN> Database: Il testo numero 1589 <id num=1589>Luca e Paola si hanno litigato . Si litigavano <CORR>litiganno</CORR> spesso ma per loro era una cosa normale . Siccome Paola voleva restare al parco ma Luca voleva ritornare a casa per guardare la TV…..</id> A questo corrisponde una Header così strutturata: 1589; Mariane,Schrei; Elisa,Suriani; Francesca,Magistro; 2005,01,25; Colonia,DE; scuola; Universitaet zu Koeln ; 19,gn;amoreKOELN; f; 19-25; 2; 3; tedesco; inglese,bambara,swahili; sp; 6,Genova; sc,med; love_C.txt Dove i diversi campi sono rispettivamente Identificativo; autore[1]; fornitore; trascrittore; data; luogo; istituzione; nome istituzione; numero di testo nel gruppo; nome del gruppo; specifiche (ossia m=maschio, f=femmina); età; status; annualità; lingua madre; altre lingue conosciute; scolarizzazione; permanenza in Italia ; esposizione; consegna; titolo

Setting the search on the DB instead allows the sequential selection of the fields to be searched: It allows to perform the text search with metadata filter:

Extra Lexicon • abaia VER:pres abbaiare • abandonata VER:pper abbandonare • Abasciate NOM ambasciata • abastanza ADV abbastanza • abbandonnavano VER:impf abbandonare • abbasa VER:pres abbassare • abbastanze ADV abbastanza • abbastaza ADV abbastanza • abbati VER:pper avere • abbe VER:remo avere • abbìamo VER:pres avere • abbimo VER:pres avere • abbiso NOM abisso • abbita VER:pres abitare • abbitato VER:pper abitare • abbitava VER:impf abitare • abbitavamo VER:pres abitare • abbitavano VER:pres abitare • Abbitiamo VER:pres abitare • abbito VER:pres abitare • abbituarmi VER:infi abituare • abbituarsi VER:infi abituare • abbituata VER:pper abituare • abbondande ADJ abbondante • abbondanzi NOM abbondanza • abbraccano VER:pres abbracciare • abbracia VER:pres abbracciare • abbraciano VER:pres abbracciare

The RIDIRE portal, created by SILFI with a consortium of Italian universities (Florence, Naples "Federico II", Roma Tre, Turin), intends to provide Italian L2 learners with a network environment for selective access to Italian linguistic use • Query interface that allows to extract frequency lists, patterns and collocates RIDIRE

Metadata -> text linguistics

Thank You • S.colombo@annoluce.net

Normalization pre-processing procedures for Newsgroup corpora