1 / 90

Introducing CRL

Introducing CRL. Computing Research Laboratory. The Computing Research Laboratory at NMSU Jim Cowie – Director Steve Helmreich – Deputy Director / 505-646-2141 shelmrei@nmsu.edu http://crl.nmsu.edu. Established in 1983 by New Mexico Legislature as a Center of Technical Excellence

mshelton
Download Presentation

Introducing CRL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introducing CRL Computing Research Laboratory

  2. The Computing Research Laboratory at NMSUJim Cowie – DirectorSteve Helmreich – Deputy Director / 505-646-2141shelmrei@nmsu.edu http://crl.nmsu.edu • Established in 1983 by New Mexico Legislature as a Center of Technical Excellence • CRL is a Research Department in the College of Arts and Sciences at New Mexico State University • From 1983 to 1989 received more than $6.5 million in state funding. • Since 1990, entirely self-supporting on research grants and contracts.

  3. CRLCapabilities and Expertise • Multi-lingual text processing • Speech processing and generation • Human Computer Interaction • Team of Computer Scientists, Psychologists, Linguists, Computational Linguists, Geographers, Biochemists and Mathematicians; (~40) • capable of delivering complex, working, prototype systems.

  4. Language Engineering at CRL • Information retrieval • Language learning and language teaching • Automatic translation • Summarization • Question answering • Dictionary development • Knowledge discovery

  5. Overview of Talk • Projects related to Machine Translation • Pragmatics-based Machine Translation • Jargon analysis project • IL Annotation project • Projects using Machine Translation • Expedition / Boas • MOQA (Question / Answering)

  6. Machine Translation triangle Interlingual (IL) Analysis Generation Transfer Source Language Target Language Direct Translation

  7. Machine Translation triangle Interlingual (IL) Analysis & Generation Generation & Analysis Transfer Source & Target Language Target & Source Language Direct Translation

  8. Machine Translation triangle Interlingual (IL) Analysis & Generation Generation & Analysis Transfer Source & Target Language Target & Source Language Direct Translation

  9. CRL Machine Translation Projects • XTRA – Chinese-English IL, 1986-88 • ULTRA – five languages IL, 1988-90 • Pangloss – multi-site Spanish-English IL, 1992-95 • Mikrokosmos – Spanish-English IL, 1995-98 • Corelli – multi-lingual transfer, 1998-2001

  10. Characteristics of IL MT • Analysis to and generation from “meaning” of the input • Disambiguation to an unambiguous language-independent representation (IL) • Use of world knowledge to disambiguate • World knowledge stored and manipulated through an Ontology

  11. Jesus of Montreal • Woman to priest guiltily coming out of her bedroom (in French): “Come on out, we’re not playing a scene from Feydeau.” • English subtitle: “Come on out. This isn’t a bedroom farce.”

  12. Which floor is this? • In a Spanish newspaper article about expensive real estate rental in Moscow: “Nothing’s available on the “segundo piso” but there’s still some space left on the “tercero piso.” • T1: second floor / third floor • T2: third floor / fourth floor

  13. Earthquakes – who is to blame? • Acumulación de víveres por anuncios sísmicos en Chile • Hoarding Caused by Earthquake Predictions in Chile • STOCKPILING OF PROVISIONS BECAUSE OF PREDICTED EARTHQUAKES IN CHILE

  14. Pragmatics-based MT hypothesis • Translations are made on the basis of interpretations • Interpretations are a set of coherent inferences about the content and the context of the message • These inferences are based on • Beliefs of the translator about the beliefs of the author • Beliefs of the translator about the beliefs of the target audience • Beliefs of the translator about the world

  15. Interpretations Machine Translation triangle Interlingual (IL) Analysis & Generation Generation & Analysis Transfer Source & Target Language Target & Source Language Direct Translation

  16. Terrorist/Freedom Fighter • sindicalistas: Union Members / Labor Leaders • asesino: killer / assassin • asesinados: murdered / assassinated • campesinos: small farmers / peasants • sin tierras: without land / landless • terrateniente: landowner / landholder

  17. Hypothesis • It is possible to identify an author's viewpoint from the vocabulary (jargon) used, particularly in the use of alternate lexical items referring to the same concept or object

  18. Hypothesis • Social groups are organized not just around topics but also around points of view and • develop jargons to express those points of view • Members of those social groups generally hold to those points of view and • Use the jargons to express themselves • THUS identifying an author’s jargon also identifies the groups he/she belongs to and the beliefs he/she is likely to hold

  19. Training Corpus • Issue: Abortion • Text Size: approximately 8000 tokens each • Text Size (types): 2273 pro-choice / 2168 pro-life • Significant unique vocabulary: 79 pro-choice / 68 pro-life • Significant common vocabulary 113 / 37

  20. Approach • Unique vocabulary: 1581 pro-choice/1476 pro-life • Common vocabulary: 692 • Significant unique vocabulary: • 79 pro-choice • 68 pro-life • Significant common vocabulary: 113 (37)

  21. Unique Vocabulary – Pro-life • abnormalities, aborted, abortifacient, abortifacients, abortion-inducing, abortionist, abortionists, adultery, amniotic, bible, blessed, cancer-causing, chastisement, chastisements, chastises, complication, complications, contrite, creator, depression

  22. Unique Vocabulary – Pro-choice • activism, activists, alley, anti-abortion, anti-choice, anti-democratic, antiabortionists, arson, arsonist, arsons, attorney, attorney’s, blockade, blockaders, blocked, blocking, bomb, bombing, bombings

  23. Pro-life clinic(s) 3 fetus 22 parenthood 2 planned 2 unborn 15 week(s) 37 woman(‘s) 9 Pro-choice unborn 1 clinic(s) 46 fetus 7 parenthood 14 planned 15 week(s) 8 woman(‘s) 27 Significant Common Vocabulary

  24. One-year project • Using sounder statistical measurements • Base line corpus • Statistically significant differences • Other methods of measuring differences • Using collocations as well as single words • Looking for “synonymous” terms • WordNet • Ontology • Rogets

  25. Experiments • Differentiate opinions in a binary opposition within texts on the subject of opposition • Differentiate opinions among a plurality of views within texts on the subject • Differentiate opinions in a binary opposition within texts on a different subject • Differentiate opinions among a plurality of views within texts on a different subject • Differentiate multiple viewpoints in any article

  26. Problems with IL approach • Idiosyncratic – no common understanding of what IL should be or look like • Limited automatic acquisition – most of the knowledge-based and lexicon is hand-coded

  27. Interlingual Annotation of Multilingual Text Corpora Computing Research Laboratory – NMSU Mitre Corporation UMIACS – U Maryland Columbia University Language Technologies Institute – CMU Information Sciences Institute – USC

  28. Approach • Collection of texts in six languages • Three translations of each into English • Tools to analyze grammatical aspects • Morphological analysis • Name recognition • Chunking

  29. Develop IL Representation • Through study of texts • Through examination of current Ils • Develop formal definition • Rich representation • Compatible with under-specification • Develop coding manuals and guarantee inter-coder reliability

  30. Annotate the Corpus • All sites / all texts • One site in charge of one aspect of IL • Frequent interaction • Regular joint meetings

  31. Evaluate the results • Inter-coder reliability • Growth rate • Grain size • Quality of generation

  32. Trends in HLT Research Funding • Focus on sub-tasks • Named entity recognition • Coreference resolution • Word sense disambiguation • Bring multi-lingual capabilities to parallel technologies • Multi-lingual IR/IE/summarization • Bring multiple technologies into one project

  33. Three such projects at CRL • Expedition / Boas • MOQA – Meaning-Oriented Question/Answering • Personal Profiler

  34. Expedition: A tool for building Machine Translation systems The Problem Given two people, a linguist who knows a language, and a programmer, provide a support system which allows them to build a machine translation system for that language in six months. Project is completed and we are now using it to build translation systems for Somali and Urdu. You can try out the system at http://aiaia.nmsu.edu

  35. Boas: “A Linguist in the Box” Boas is a semi-automatic knowledge elicitation system that guides a language speaker through the process of developing the static knowledge sources for a moderate-quality, broad-coverage MT system from any “low-density” language into English in about six months. Some of the tasks include providing a list of characters and morphological features, paradigms for inflected classes, equivalents of closed-class items, translation of place names and open class items from English into the source language.

  36. Language knowledge acquisition has been a bottleneck for MT development and deployment for over 40 years. At the same time, the dearth of data resources has strongly limited the deployment of any of the recent corpus-based techniques in practical MT environments. • Expedition is a “quick ramp-up” MT environment between “low density” languages and English which is a step to alleviating these problems. • Boas, the main knowledge acquisition module inside Expedition, includes resident knowledge about • a set of potential source languages • generalized parametric typological knowledge about languages in general and • methods and configurations for human-computer interaction. • It is designed for use by a team which does not include trained computational linguists.

  37. Boas contains knowledge about human language and means of realization of its phenomena in a number of specific languages and is, thus, a kind of a “linguist in the box” that helps non-professional acquirers with the task whose complexity is well-known.

  38. The ethnologist and linguist Franz Boas was the founder of the American school of descriptive linguistics. In this photo, circa 1900?, he is shown posing for a model which was being made of a Kwakuitl Winter Ceremonial dancer in which the dancer emerges from within a circular hole cut in the dancing screen.

  39. Meaning-Oriented Question-Answering with Ontological Semantics An AQUAINT Project from ILIT

  40. Development Strategy • Meaning oriented question answering • Rapid Prototyping using pre-existing components • Evaluation of end-to-end system performance for specific tasks (collaboration with AWARE project, Bill Ogden, CRL) • Project commenced August 2002 • Current system runs on Linux or Windows 2000

  41. Meaning-Oriented Question-Answering with Ontological Semantics • Initial Domain: travel and meetings • question understanding and interpretation • determining the answer and • presenting the answer • two kinds of data source • Structured Fact Repository containing instances of ontological entities • open text (in English, Arabic and Farsi)

  42. System Overview (V0) human Document Sources machine Document Retrieval Text Analyzer Query Interface & Answer Formulation Human Acquisition Fact Repository

  43. System Overview (V1 now) human Document Sources real-time Document Retrieval batch Text Analyzer Query Interface & Answer Formulation Fact Repository questions

  44. System Overview (V2) human Document Sources real-time Document Retrieval batch Text Analyzer Query Interface & Answer Formulation Fact Repository questions & texts

  45. Batch Processing Overview Web Spider Documents Keizai Indexing Document Collection Keizai Retrieval Document Subset Text Analysis Text Meaning Representation TMR to FR Converter Fact Repository

  46. Batch Mode - Fact Repository Population • Spidered contemporary text • Retrieval done using Keizai retrieval system (Unicode based) • Uses a list of interesting people and travel keywords • Selected documents saved and automatically processed using UMBC’s analyzer (which produces text meaning representations) • Instances of concepts from TMR extracted and stored in Fact Repository

  47. Interactive Processing Overview Query Interface NL Query Analyzer Answer formulation Information Server TMR XML Answer Instance Finder Instances Fact Repository

More Related