150 likes | 317 Views
A Text Processing Tool for the Romanian Language. Oana Frunza and Diana Inkpen David Nadeau School of Information Technology and Institute for Information Technology Engineering, University of Ottawa National Research Council of Canada
E N D
A Text Processing Tool for the Romanian Language Oana Frunza and Diana Inkpen David Nadeau School of Information Technology and Institute for Information Technology Engineering, University of Ottawa National Research Council of Canada {ofrunza,diana}@site.uottawa.ca David.Nadeau@nrc-cnrc.gc.ca
Outline • BALIE System • RO-BALIE • Capabilities • Improvements • Evaluation & Results • Future Work
BALIE- BaseLine Information Extraction • Multilingual information extraction system • Language identification • Tokenization • Sentence boundary detection • Part-of-speech tagging for English, French, German, Spanish [1] • Java trainable open source system • Uses WEKA [2] a Machine Learning Tool • Uses QTag [3] – a language independent probabilistic part-of-speech tagger
BALIE- BaseLine Information Extraction (cont.) • Input Example 1.Introduction Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts.
BALIE- BaseLine Information Extraction (cont.) • Output <?xml version="1.0" ?> <balie> <tokenList> <s> <token type="2" pos="number" canon="1">1</token> <token type="1" pos="period" canon=".">.</token> <token type="2" pos="noun" canon="introduction">Introduction</token> </s> <s> <token type="2" pos="noun" canon=“information">Information</token> … </s> </tokenList> </balie>
RO-BALIE • Improvements • Easier manipulation of the input and output texts • A new tag set that maps the numerical tag set internally used by BALIE • More information in the output provided by the system Available at: http://www.site.uottawa.ca/~ofrunza/RO-Balie/RO-Balie.html
RO-BALIE • Language Identification • 2-grams (sequence of 2 characters) • Naïve Bayes classifier • Overall accuracy is: 99.25%.
RO-BALIE (cont.) • Tokenization • Split each compound word based on “-” and “/” • Examples: iat-o,socio-economic Tokenization results:
RO-BALIE (cont.) • Sentence Boundary Detection • Training – 106 hand-tagged English sentences • Decision Tree Classifier • Features • Beginning of the sentence – first token • Previous token • Current token • Next token
RO-BALIE (cont.) • Sentence Boundary Detection (cont.) • Feature values • Period, Open Quote, Close Quote, New Line, Capital Word, Digit, Abbreviation, etc. • A list with Romanian abbreviations (510) • Evaluation on Orwell’s 1984 novel
RO-BALIE (cont.) • Part-of-speech tagging – QTag tagger • Used a corpus of 40 million words of newspaper articles • Romanian newspapers 3-year period • The training corpus is 98% accurate • Our system has a tagset of 14 tags for POS and 30 tags for punctuations
RO-BALIE (cont.) • Output for Apel tirziu si inutil NISTORESCU. <?xml version="1.0" ?> <balie> <Language ID="Romanian"> <tokenList> <Tokens Count="896"> <s id="1"> <token type="2" pos="NN" canon="apel">Apel</token> <token type="2" pos="ADV" canon="tirziu">tirziu</token> <token type="2" pos="CJ" canon="si">si</token> <token type="2" pos="NN" canon="inutil">inutil</token> <token type="2" pos="PN" canon="nistorescu">NISTORESCU</token> <token type="1" pos="PER" canon=".">.</token> </s> </Tokens> </tokenList> </Language> </balie>
RO-BALIE (cont.) • Future Work • Use machine learning for the tokenization task • Add new services: morphological analysis, named entity recognition, etc. • Add more specific information for each supported language.
RO-BALIE (cont.) • References 1. http://balie.sourceforge.net/index.html 2. http://www.cs.waikato.ac.nz/~ml/weka/ 3.http://www.english.bham.ac.uk/staff/omason/software/qtag.html http://www.site.uottawa.ca/~ofrunza/RO-Balie/RO-Balie.html
THANK YOU! ?? ? ?