460 likes | 682 Views
The Corpógrafo Theory and Practice. Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA. PALC ’97 – 'Do-it-yourself corpora ... with a little bit of help from your friends!' CULT 1998 - ‘Making corpora – a learning process’. Contrastive linguistics Corpora linguistics
E N D
The CorpógrafoTheory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA ABRAPT Mini-curso 30.08.04
PALC ’97 – 'Do-it-yourself corpora ... with a little bit of help from your friends!' CULT 1998 - ‘Making corpora – a learning process’ Contrastive linguistics Corpora linguistics Translation teaching General > specific language A bit of history ABRAPT Mini-curso 30.08.04
2000 – First Master’s in Terminology and Translation at FLUP PALC 2001 - ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’ Specialized translation and terminology Contact with domain experts Importance of IT Need for technical help for more ambitious students! A bit of history ABRAPT Mini-curso 30.08.04
LREC 2002 - ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’ 2002 – Second Master’s in Terminology and Translation at FLUP Plea for help to Diana Santos October 2002 LINGUATECA - Polo FLUP A bit of history ABRAPT Mini-curso 30.08.04
LINGUATECA • See http://www.linguateca.pt • Leader > Diana Santos (SINTEF – Oslo) • Objective - to create resources and tools for the computational processing of Portuguese • Poles at Oslo, Lisbon, Braga and Porto • Porto – Polo CLUP/FLUP ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP • See http://www.linguateca.pt/poloclup/ • On-line suite of corpora tools to work with comparable corpora with emphasis on bilingual research • Focus on special domains • Construction of terminology databases, ontologies and domain models • Corpógrafo ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP • See http://www.linguateca.pt/poloclup/ • General help in constructing resources specific to the need of FLUP/CLUP • For researchers, teachers and students • For teaching methodology at FLUP • BNC & Reuter’s corpora on intranet • A small ‘chat’ corpus ABRAPT Mini-curso 30.08.04
More history • 2003 – Poster of the GC – at CL2003 • 2003 – ‘What are comparable corpora?’ CL2003 • 2003 – Experimentation with evaluation of Machine Translation • 2003 – Experimentation with GC • 2003 – Third Master’s in Terminology and Translation at FLUP ABRAPT Mini-curso 30.08.04
GC – Integrated Web Environment for Corpora Linguistics • What is GC? • GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to: • access several Corpora tools from a single entry point using a regular web browser • access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico) • build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT) • use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.) • communicate and exchange results with other users • Internet Integration • GC provides seamless integration with the World Wide Web allowing users to: • search specific Corpora resources on the Internet • query the web for concordances • use available translation-engines in parallel. • Developer’s Tasks: • Integrate Existing Tools/Resources • Develop Additional Generic Tools • Interact with Users/Administrator • Develop Custom Tools for particular research needs Developer Task: • Administrator’s Tasks: • Users, Groups and Disk Quotas • Corpora Taxonomy (see box) • Documentation Organization • Access Service Statistics Corpora Taxonomy Teacher’s Tasks: • Medium: written, spoken, multimedia • Domain: Engineering, medicine, etc. • Genre: scientific, technical, informative, etc. Inter-User Communication • Provide on-line tutorials • Provide links to: • on-line teaching material • bibliography and other resources • Tagging and Aligning Cooperatively • Messaging Service • Exchange of Corpora Resources • Motivation • Lack of Comprehensive, wide-scope Corpora Tools • Commercial Packages are usually difficult to Integrate/Customize • Tools are not prepared to support cooperative work. • Linguistic knowledge is not usually integrated in tools. BNC CETEM Público COMPARA Others Custom Interface Custom Interface Custom Interface Custom Interface • Concordance Engine • Taggers • Aligner (Semi-Auto) • Corpora Bot • Statistics • Custom Tools DEV Internet Tool Pool Terminology DB Inter-user Communication Personal Corpora Virtual Desktop Terminology Extraction Tool (Auto/Semi-Auto) ADM USER PDF PS RTF TXT HTML DOC ABRAPT Mini-curso 30.08.04
And then... • PoloCLUP’s 3rd function: • Evaluation of Machine Translation • Experimentation with evaluation • Teaching + research focus • Results: • TrAva – MT evaluation tool • CorTA – Corpus of 1 EN input + 4 MT output sentences ABRAPT Mini-curso 30.08.04
Prescriptive v descriptive terminology • Paper > digital form • Static > dynamic resources • ‘Democratization’ of terminology • ISO standards > socioterminology • Knowledge structures increasingly recognized as structured but dynamic - ask Gerhard Budin to explain this to you …. ABRAPT Mini-curso 30.08.04
Domain experts and vested interests Translators Information retrieval Knowledge engineering Standardized terminology Getting the right word Finding information Perfecting Google Structuring knowledge Finding it fast Perspectives of terminology users ABRAPT Mini-curso 30.08.04
General linguists Translation teachers Translation students Corpus linguists Computational linguists Computer engineers Computer-phobia Computer-worship Bridging the Gap ABRAPT Mini-curso 30.08.04
The Corpógrafo combines: • Terminology, translation and language study and research (Belinda) • Terminology databases (Domain experts) • Computational linguistics research and production of resources (Diana) • Information retrieval and artificial intelligence (Luís) = Discussions on priorities! ABRAPT Mini-curso 30.08.04
Corpora and Terminology • Corpora as input • Terminology extraction • Terminology databases • Structuring of domain knowledge • Further corpora ABRAPT Mini-curso 30.08.04
Internet Corpora Corpora Analysis Terminology Database Text details Text details Text details ABRAPT Mini-curso 30.08.04
Working with the Corpógrafo • Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research • All research done ONLINE • Each username/password = separate space on our server • At present > anyone can work with it using 10 MB space for FREE • BUT - you get an empty space + tools + tutorial! ABRAPT Mini-curso 30.08.04
Terminologyold v new • Prescriptive > descriptive • Paper > digital form • Static > dynamic resources • ‘Democratization’ of terminology • ISO standards > socioterminology • Knowledge structures increasingly recognized as structured but dynamic - ask Gerhard Budin to explain this to you …. ABRAPT Mini-curso 30.08.04
Domain experts and vested interests Translators Information retrieval Knowledge engineering Standardized terminology Getting the right word Finding information Perfecting Google Structuring knowledge Finding it fast Perspectives of terminology users ABRAPT Mini-curso 30.08.04
General linguists Translation teachers Translation students Corpus linguists Computational linguists Computer engineers Computer-phobia Computer-worship Bridging the Gap ABRAPT Mini-curso 30.08.04
Focus of Corpógrafo • Design priorities are to: • See the Big Picture • Create the Overall Framework • Get feedback from users to see their needs • Develop according to real research needs • Fill in the details and improve techniques as needed ABRAPT Mini-curso 30.08.04
Corpógrafo and special domains • Master’s in Terminology and Translation • Terminology projects with the support of domain specialists in: • Engineering – Electronics, Mechanical Engineering • Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, • Medicine - Kidney support machines, Neurology • Science – Genetics • Technology – GPS – Geographical Positioning Systems ABRAPT Mini-curso 30.08.04
Corpógrafo and terminology/translation research • Ongoing dissertations on aspects of: • Terminology – databases for different uses, neologisms, definition searches, semantic relations, conceptual analysis • Corpora – text analysis, corpora construction • Technical writing > Electrical Appliances • Localization • Terminology in documentaries • Translation of Multimedia ABRAPT Mini-curso 30.08.04
Linguateca • Linguateca’s policy - all resources and tools freely available online • Primary users - Portuguese and Brazilian ABRAPT Mini-curso 30.08.04
Polo CLUP/FLUP • Bi- or multi-lingual in interest • Corpógrafo available for experiments on a small scale to the general public • Possibilities of future work on projects with users from other universities and other countries ABRAPT Mini-curso 30.08.04
Contacts If you are interested is finding out more, please contact me: Belinda Maia bmaia@mail.telepac.pt The Corpógrafo can be used (with a username and password) at: http://www.linguateca.pt and http://poloclup.linguateca.pt/ferramentas/gc ABRAPT Mini-curso 30.08.04
Corpógrafo • File Manager - area where each individual or group can: • convert various text formats to .txt • upload texts to their space on server • ‘clean’ them of unnecessary material • check tokenization and sentence divisions • consult wordlists – alphabetical, frequency etc • group texts into corpora • register full information on source, domain and text type ABRAPT Mini-curso 30.08.04
Corpógrafo 2. Corpora analysis area: • Concordancing tools allowing for • KWIC concordancing • KWIC concordancing with sorted according to word to left or right • N-gram tool • N-grams • Term-candidates • With filters for PT ABRAPT Mini-curso 30.08.04
Corpógrafo 3. Terminology database • Terms • Definitions • Examples • Morphology • Multilingual equivalents • Sources and text details of corpora used • Semantic relations – further complexity ABRAPT Mini-curso 30.08.04
Internet Corpora Corpora Analysis Terminology Database Text details Text details Text details ABRAPT Mini-curso 30.08.04
Future developments – general policy • General testing and improvement of the Corpógrafo • Experimentation with ideas from other projects:- e.g. Wordnet, Framenet • Experimentation with theories of semantic primitives, human universals etc • Development of new ideas or functions – using isomorphic relationships between researchers’ needs and our possibilities ABRAPT Mini-curso 30.08.04
Future developments- File Manager • Creation of overall framework – perhaps UDC based – for: • consultation of research available to public • information on ongoing research • Coordination of individual corpus projects into bigger projects, when possible or necessary ABRAPT Mini-curso 30.08.04
File ManagerTheoretical questions • Domain organization – UDC or ? • Categorization of text by genre – how many genres? • Reliability of texts from Internet – how does one guarantee quality? • Is a translator or linguist able to distinguish a ‘good text’? • Should the domain specialist choose the texts? ABRAPT Mini-curso 30.08.04
Corpora constructiontheoretical questions / problems • How large is a good domain corpus? • No domain corpus will produce EVERY term in the area • Comparable corpora v. Parallel corpora • Aligning comparable corpora at term level ABRAPT Mini-curso 30.08.04
Future developments- Corpora analysis • Development of finer-grained concordancing • Experimentation with finding definitions in context • Semi-automatic creation of keyword shortlists for further text retrieval ABRAPT Mini-curso 30.08.04
Corpora AnalysisTheoretical questions • How far can one rely on the computational linguist or computer engineer to produce analyses of corpora? • If (semi-) automated processes produce 80% possible results, should the linguist / translator rubbish these processes? • Can we leave it all the computer engineer? ABRAPT Mini-curso 30.08.04
Future developments- terminology databases • Refinement of terminology fields • Development of further multi-lingual functions • Development of organized and robust set of semantic relations • Semi-automatic visualizing of semantic relations ABRAPT Mini-curso 30.08.04
Terminology databasesTheory • How much information does a database need? • How much does the user of a database need? • Is it reasonable to hope that all our databases could one day communicate with each other and help us with translation / information retrieval – or whatever? ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being used at present? • Master’s in Terminology and Translation • Terminology projects with the support of domain specialists in: • Engineering – Electronics, Mechanical Engineering • Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, • Medicine - Kidney support machines, Neurology • Science – Genetics • Translation and Localization ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being used at present? • Dissertations completed on: • Definitions for different purposes + pedagogical glossary for Corrosion, Electrical engineering http://www.fe.up.pt/~cdm/QAE/QAE_gloss_b.htm • Socioterminology – in the area of Composite Materials • Graphical representation of Conceptual systems • Terminology and Metaphors • Football Metaphors ABRAPT Mini-curso 30.08.04
How is the Corpógrafo being used at present? • Ongoing dissertations on aspects of: • Terminology – databases for different uses, neologisms, conceptual analysis • Corpora – text analysis, corpora construction • Translation and localization terminology • Technical writing > Electrical Appliances • Terminology in documentaries ABRAPT Mini-curso 30.08.04
Pedagogical applications of the Corpógrafo • Undergraduate courses – only possible if both teachers and students are trained to use it • Postgraduate research • Terminology and translation (Belinda + domain experts) • Computational linguistics (Diana) • Information retrieval (Luís) • Long live team work! ABRAPT Mini-curso 30.08.04
To what extent is the Corpógrafo available to others? • Linguateca’s policy is to make all resources and tools available online • Primary users are expected to be Portuguese and Brazilian as most of resources and tools are for Portuguese • PoloFLUP’s main objective – comparable corpora and terminology tools ABRAPT Mini-curso 30.08.04
To what extent is the Corpógrafo available to others? • PoloFLUP is, by definition, bi- or multi-lingual in interest • The Corpógrafo is therefore available for experiments on a small scale to the general public • In the future – we hope to be able to work on projects with users from other universities and other countries ABRAPT Mini-curso 30.08.04
Contacts If you are interested is finding out more, please contact me: Belinda Maia bmaia@mail.telepac.pt The Corpógrafo can be used (with a username and password) at: http://www.linguateca.pt and http://poloclup.linguateca.pt/ferramentas/gc ABRAPT Mini-curso 30.08.04