430 likes | 509 Views
Using the Corpógrafo. Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA. First steps. Get a username and password You will receive one automatically. Working with the Corpógrafo. Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research All research done ONLINE
E N D
Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA USP workshop
First steps • Get a username and password • You will receive one automatically USP workshop
Working with the Corpógrafo • Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research • All research done ONLINE • Each username/password = separate space on our server • At present > anyone can work with it using 10 MB space for FREE • BUT - you get an empty space + tools + tutorial! USP workshop
Help Files • Introdução à utilização do Corpógrafo - um pequeno tutorialA tutorial – to be translated into English – describing the whole process of terminiology research using the Corpógrafo. Available in PDF. • Corpógrafo RoadmapIn English and Portuguese – a panoramic view of the Corpógrafo and how it works. Available in PDF. • The Corpógrafo in Easy StagesIn English and Portuguese – User’s guide to the Corpógrafo and FAQ. Available in PDF. • Also Note > on entry page there is a Glossary of terms and instructions PT > EN USP workshop
File Manager Area where each individual or group can: • upload texts to space on server • convert various text formats to .txt • ‘clean’ them of unnecessary material • check tokenization and sentence divisions • register full information on source, domain and text type • group – and re-group - texts into corpora USP workshop
File Manager • 1. Files • >List Files on Server • >Add Files • >Add Files from URL (Experimental!)2. Corpora • > List Corpora> Compile New Corpus USP workshop
EXTEX • Tool for converting file formats to .txt at: • http://poloclup.linguateca.pt/ferramentas USP workshop
General corpus analysis Corpora analysis area: • Concordancing tools for regular expressions • at sentence level • KWIC concordancing • Collocations • N-gram tool • Case-sensitive • Alphabetical or frequency ordering USP workshop
Corpora + TDB • Choose corpus • Choose related TDB = All terms, examples, definitions extracted from corpus (semi) automatically transferred to TDB = All metadata on texts in corpus can be automatically transferred to TDB USP workshop
Term extraction • N-grams • Unfiltered • Filtered with restrictions on term in PT,EN,FR,IT,ES,DE • Filtered with restrictions on term and context in PT,EN,FR,IT,ES,DE • Singular + plural terms can be combined • Existing terms in TDB need not appear USP workshop
Term selection from n/grams • Consultation of list of n-grams • Check term status of each n-gram via underlying concordances • Check sources • Send to TDB USP workshop
Search for definition candidates • Already possible via TDB • Under development • Research area for Mestrado dissertations and bolseiros USP workshop
TDB - Terminology database Databases are designed to be multilingual • Terms listed alphabetically + language tag • General data • Morphological data • Source metadata: Authors, texts etc • Definitions + search for candidates • Translation equivalents • Semantic relations USP workshop
Future developments – general policy • General testing and improvement • Development of new ideas or functions – using isomorphic relationships between researchers’ needs and our possibilities • Coordination of individual corpus projects into bigger projects, when possible or necessary USP workshop