200 likes | 210 Views
This paper discusses the importance of measuring monolinguality in language corpora and presents a method for quantifying the amount of noise in a corpus affected by foreign language material. Experimental results demonstrate the effectiveness of the proposed method. The paper also explores potential applications of the measurement technique in working with monolingual corpora.
E N D
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language and Speech Resources, Genova 27 May 2006
Why Monolinguality ? Alien language noise disturbs statistics for corpus-based methods: • Language Models, e.g. n-gram • Lexical Acquisition • Semantic Indexing • Co-occurrence Statistics
What is Monolinguality? • Foreign language sentences should be removed • Sentences containing few foreign language words or phrases, such as movie titles, terminology etc. should remain.
Korean Example • A:Yes. The traffic cop said I had one too many and made me take the sobriety test, but I passed it. B:Lucky you ! • 무인도 표류 소년 25명 통해 인간의 야만성 그려 영국 소설가 윌리엄 골딩의 83년 노벨문학상 수상작을 영화화한 `파리대왕'(Lord of the flies)은 결코 편안하게 감상할 수 있는 영화는 아니다 .
Recall Zipf‘s Law It holds also for random samples of words Top frequent words
Measuring Monolinguality Given a corpus of language A with x% noise of language B, the amount of noise is measured: • For top frequency words of B, divide the relative frequency in the corpus by the relative frequency of a clean B corpus • The amount of noise is the predominant ratio: many ratios will be close to x%.
The top frequency words of B w.r.t. A • Words that do not occur in language A. Their frequency ratio will be around x%. • Words that are also amongst the highest frequency words of language A and moreover have the same function. Their frequency ratio will be around 1. • Words that occur in language A, but at different frequency bands. They are a random sample of words of L and distributed in a Zipf way • Words of B that are often used in named entities and titles (such as capitalized stop words). They appear in the corpus of language A more frequently then the expected x% of noise. The second group of words is only present in languages that are very similar to each other.
Experiment 1 Artificial noise mixtures: Injecting alien language material in monolingual corpora • Experiment 1a: Injecting different amounts of German Noise in a chunk of the British National Corpus (~ 20 Million words) • Experiment 1b: Injecting 1% noise of Norwegian, Swedish and Dutch into a Danish corpus (~17 Million words) For measuring, we used the top 1000 words
Before cleaning After cleaning Number of top-1000-words found Approx. Frequency ratio Number of top-1000-words found Frequency ratio German 1000 0.708 1000 0.946 English 995 0.126 987 0.0010 French 924 0.0398 906 0.00002 Dutch 995 0.000891 775 0.000006 Turkish 642 0.0000631 562 0.000006 Experiment 2 For a collection of web documents (~700 Million words from .de domains, we measure the effect of a corpus cleaning method that strips alien language material
Conclusion • Measure captures well the amount of noise • Noise measured down to a ratio of 10-5 • Effective: involves 1000 frequency counts per language
Application: Monolingual Corpora • Screenshot corpora http://corpora.uni-leipzig.de
Workflow Texts: Web / Newspapers Dictionaries (Dornseiff, WordNets, Wikipedia, ...) Small Worlds URLs Crawling Small Worlds Clustering Classification Words Text Text Text Text • Similar objects (words, sentences, documents, URLs) • Classification (se-mantic properties, subject areas, ...) • Combined objects (NE-Recognition, terminology, ...): determine patterns,extract multi-words Resources Techniques Results Language detection, Cleaning • Decomposition • Morphology • Inflection • Translation pairs lang. 1 lang. 2 lang. n ... Language +Time Tools Co-occurrences etc. POS Tagging • Neologisms • Trend Mining • Topic Tracking Standard Size Corpora Web Statistics Dictionaries Classified Objects Language Statistics Small Worlds
CorpusBrowser Per word: • Frequency • Example sentences • Co-occurrences: left and right neighbours, sentence-based • Co-occurrence graph
Only a few copies left! DVD: • 15 languages • Corpus Browser • Corpora in plain text and database format
Questions?? THANK YOU!