200 likes | 320 Views
Mining the Web to Create Minority Language Corpora. Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic J. Stefan Institute, Slovenia. Who Needs a Language Specific Corpus?. Language Technology Applications Language Modeling
E N D
Mining the Web to Create Minority Language Corpora Rayid GhaniAccenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic J. Stefan Institute, Slovenia
Who Needs a Language Specific Corpus? • Language Technology Applications • Language Modeling • Speech Recognition • Machine Translation • Linguistic and Socio-Linguistic Studies • Multilingual Retrieval
What Corpora are Available? • Explicit, marked up corpora: Linguistic Data Consortium -- 20 languages [Liebermann and Cieri 1998] • Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese • Excite - 12 languages • Google - 25 languages • AltaVista - 25 languages • Lycos - 25 languages
You’re just out of luck! BUT what about Slovenian? Or Tagalog? Or Tatar?
The Human Solution • Start from Yahoo->Slovenia… • Crawl www.*.si • Search on the web, look at documents, modify query, analyze documents, modify query,… • Repetitive, time-consuming, requires reasonable familiarity with the language
Task • Given: • 1 Document in Target Language • 1 Other Document (negative example) • Access to a Web Search Engine • Create a Corpus of the Target Language quickly with no human effort
Algorithm Query Generator WWW Seed Docs Language Filter
Build Query Learning Web Initial Docs Word Statistics Relevant Filter Non-Relevant
Query Generation • Examine current relevant and non-relevant documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones • A Query consists of minclusion terms and nexclusion terms • e.g +intelligence +web –military
Query Term Selection Methods • Uniform (UN) – select k words randomly from the current vocabulary • Term-Frequency (TF) – select top k words ranked according to their frequency • Probabilistic TF (PTF) – k words with probability proportional to their frequency
Query Term Selection Methods • RTFIDF – top k words according to their rtfidf scores • Odds-Ratio (OR) – top k words according to their odds-ratio scores • Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores
Evaluation • Goal: Collect as many relevant documents as possible while minimizing the cost • Cost • Number of totaldocumentsretrieved from the Web • Number of distinct Queries issued to the Search Engine • Evaluation Measures • Percentage of retrieved documents that are relevant • Number of relevant documents retrieved per unique query
Experimental Setup • Language: Slovenian • Initial documents: 1 web page in Slovenian, 1 in English • Search engine: Altavista
Results – Precision at 3000 Percentage of Target Docs after 3000 Docs Retrieved
Results - Summary • In terms of documents: • For lengths 1-3, Odds-Ratio works best • In terms of queries: • Odds-Ratio is consistently better than others • Long queries are usually very precise but do not result in a lot of documents (low recall)
Further Experiments • Comparison to Altavista’s “More Like This” • Better performance than Altavista’s feature • Keywords • Similar results when initializing with keywords instead of documents • Other Languages • Similar results with Croatian, Czech and Tagalog
Conclusions • Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines • Not sensitive to initial “seed” documents • System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder
Ideas for Future Work • Explore other Term-Selection methods • From Language specific corpus to Topic Specific corpus as an alternative to focused spidering • Finding documents matching a user profile – Personal Agent