210 likes | 314 Views
6th International Conference on Language Resources and Evaluation LREC 2008. A Suite to Compile and Analyze an LSP Corpus. Rogelio Nazar – Jorge Vivaldi – M. Teresa Cabré { rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu. Introduction.
E N D
6th International Conference on Language Resources and Evaluation LREC 2008 A Suite to Compile and Analyze an LSP Corpus Rogelio Nazar – Jorge Vivaldi – M. Teresa Cabré {rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu
Introduction This system (JAGUAR) is a set of tools for compiling and exploring an LSP corpus from the web http://jaguar.iula.upf.edu Usage Examples: • Terminology extraction • Bilingual lexicon extraction • Neologisms extraction Architecture: a system divided in two main modules: • Compilation of an LSP corpus from the web • Analysis of the corpus with statistical techniques
Module 1: Compilation of an LSP corpus from the web • Document retrieval by querying search engines • Classification of the collection on the basis of two axis: • Degree of relevance to the topic • Possibility of corpus tuning with user feedback • Degree of specialization of the document • Structure of the document (abstract, introduction, etc.) • System for bibliographical references, etc. Final classification is the result of the combination of these factors.
Module 1: Compilation of an LSP corpus from the web Classification by degree of relevance to the topic:
Module 1: Compilation of an LSP corpus from the web Classification by degree of relevance to the topic:coocurrence graphs
Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Cumulative precision in the ranking of documents with the term spastic diplegia.
Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Precision and Recall for the experiments.
Module 1: Compilation of an LSP corpus from the web Evaluation of the documents classification: Probability distribution of precision as a random variable (performance of 10.000 random classifiers).
Module 2: Analysis of the corpus with statistical techniques • 1. Input: from module 1 or from user compiled corpus • 2. Main functions: • Measures of vocabulary richness • Analysis of sample representativeness • Automatic language recognition • Kwic search • N-grams extraction and sorting • Collocations extraction • Measures of association • Models of term distribution • Coefficients for vector comparison
Conclusions • We have presented the system JAGUAR, set of tools for compiling and exploring an LSP corpus from the web • The main characteristics of this suit are the following: • It is able to collect an LSP corpus from the web, ensuring the thematic adequacy and degree of specialization to a given domain • It offers tools to statistically explore such collection in a friendly interface • It has also been conceived as a library • The original algorithms have been successfully evaluated • It usage save time and effort in the analysis of a corpus offering also new insights, a perspective of the data invisible to the naked eye.
Future Work • Project is now growing in different directions: • Progressive enhancement with new functions and algorithms • Turning into a desktop application