500 likes | 784 Views
Empiricism from TMI-1992 to AMTA-2002 to AMTA-2012. Have IBM Models 1-5 failed to solve all the world’s problems? Kenneth W. Church AT&T Labs-Research church@att.com. The organizers asked me…. What's changed since TMI-92 (if anything)?
E N D
Empiricism from TMI-1992 to AMTA-2002 to AMTA-2012 Have IBM Models 1-5 failed to solve all the world’s problems? Kenneth W. Church AT&T Labs-Research church@att.com
The organizers asked me… • What's changed since TMI-92 (if anything)? • TMI-92: great excitement over the use of aligned parallel corpora to help human translators (translation tools) • Also, much controversy over IBM Models 1-5 • So what's happened since 1992? • Empiricism has come of age • Textbooks: Charniak, Jelinek, Manning & Schultze, Jurafsky & Martin • Textbooks courses in many universities around the world • What used to be considered radical is now accepted practice • Evaluation is practically required for publication • Mercer’s fighting words: More data is better data! • Aren’t as shocking when Brill makes the case a decade later • The new field of Machine Learning has absorbed many good (and formally controversial) ideas including • IBM Models 1-5 • Yarowsky's Word Sense Disambiguation • Grew out of Machine Translation, • But is now widely cited in Machine Learning as an early example of co-training AMTA
Has the pendulum swung too far? • What happened since TMI-1992 (if anything)? • 1980-2000: Revival of Empirical Methods • Have empirical methods become too popular? • Has too much happened since TMI-1992? • I worry that the pendulum has swung so far that • we are no longer training students for the possibility • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: • Statistics and Machine Learning • as well as Linguistic Theory AMTA
Empiricism:Academia Commercial Practice • Empiricism has not only come of age in academic venues (e.g., conferences, textbooks) • but also in commercial venues • Translation tools (e.g., alignment): • Academia commercial practice (Trados) • Good Applications for Crummy MT • Even better apps: • CLIR (cross-language information retrieval) • MT in web search engines (Systran & AltaVista) AMTA
So, what do I expect to happen over the next decade? • Scale, stupid: • There is a lot of excitement about the web • Not only large and growing and sexy • But also contains a rich structure of hypertext links • I will propose a bait and switch strategy • Bait: public Internet • Switch: the real target is something larger and more valuable • but more elusive • Good Apps for Crummy NLP: • Spend more time on: • what we can do with what we have • and not spend all our resources on the core technology • There is a lot to a killer app: • Great technology helps, but there is a lot more • Similar arguments apply beyond MT to much of NLP and speech AMTA
OverviewHistorical rational reconstruction emphasizing empiricism & business • Before TMI-1992: How to Cook a Demo • TMI-1992 Debate: Rationalism v. Empiricism • Hybrid/Tools: • Kay’s Workstation • Good Apps for Crummy MT • Trados • What happened to the IBM-Approach to MT? • Support for human translators (Translation Tools: Trados) • Fully automatic apps (CLIR) • Academia (Machine Learning) • Revival of Empiricism: A Personal Perspective • The IBM-approach to MT was always controversial • But there are lots of less controversial spin-offs: • Tools, lexicography apps, word sense, machine learning • Future (AMTA-2012): Bait and Switch Strategy • Bait: Use Public Internet to develop and test and socialize new ways of extracting value • Switch: Apply learnings to larger and more valuable private linguistic repositories • Market sizing: translation business is too small for fortune-500 companies • Major lasting contribution of IBM-Approach: academic (machine learning) • MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA
How To Cook A Demo(TMI-1992) • Great fun! • Effective demos • Theater, theater, theater • Production quality matters • Entertainment >> evaluation • Strategic vision >> technical correctness • Maturity: Many fields have come of age since 1950s • Computer Science, Artificial Intelligence, Machine Learning, Natural Language, Machine Translation, Empiricism • Success/Catastrophe • Warning: demos can be too effective • Dangerous to raise unrealistic expectations • Seeds of empiricism • Empirical methods: speech language AMTA
Let’s go to the video tape!(Lesson: manage expectations) • Lots of predictions • Entertaining in retrospect • Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. • Machine Translation (1950s) video • Classic example of a demo embarrassment in retrospect • Translating telephone (late 1980s) video • Pierre Isabelle pulled a similar demo because it was so effective • The limitations of the technology were hard to explain to public • But well understood by research community • We aren’t asking what happened to translating telephones (if anything) • Apple (~1990) video • Still having trouble setting appropriate expectations • Strategy: Point & Click Speech recognition • What happened to that (if anything)… Has it moved to Microsoft? • Andy Rooney (~1990): reset expectations video AMTA
TMI-1992 Debate: Rationalism v. Empiricism • Self-organizing systems (IBM) • Statistics do it all (no human intuition) • Stone Soup (Wilks) • Statistics don’t do nothing (all human intuition) • Hybrid/Tools (Kay’s Workbench) • Proper Place of Men and Machines in MT • Use people for what they are good at • Easy vocabulary and easy grammar • Use machines for that they are good at • Technical terminology, translation memories, re-use of previously translated texts • Good Application of Crummy MT • Trados • Pragmatism: low hanging fruit • Supply: do what we can do (with or without stats) • Demand: do what is worth doing AMTA
Stone Soup Debate(mid-1990s) • IBM-style MT is obnoxious • Agreed • It has all been done before • Agreed • Stone soup: they’ve been adding intuition to their stats • Agreed • It doesn’t work (Systran is better) • Systran is also better than Pangloss • It isn’t about empiricism, evaluation, etc. • Martin Kay’s advice about debating • Natural Ceiling • Chomsky used this argument against Shannon • In the part of speech case, the ceiling was broken with stats • Lack of data (lots of Canadian Hansards, but not much else) • We don’t hear this argument so much any more… • The Future: Hybrid Approaches • Agreed AMTA
Bottom line Hybrid/Tools • Yorick will get the last word • But from his abstract, it looks like he’s going to tell us that I was right all along • And just in case he doesn’t… • Let me say it now: I told you so! AMTA
OverviewHistorical rational reconstruction emphasizing empiricism & business • Before TMI-1992: How to Cook a Demo • TMI-1992 Debate: Rationalism v. Empiricism • Hybrid/Tools: • Kay’s Workstation • Good Apps for Crummy MT • Trados • What happened to the IBM-Approach to MT? • Support for human translators (Translation Tools: Trados) • Fully automatic apps (CLIR) • Academia (Machine Learning) • Revival of Empiricism: A Personal Perspective • The IBM-approach to MT was always controversial • But there are lots of less controversial spin-offs: • Tools, lexicography apps, word sense, machine learning • Future (AMTA-2012): Bait and Switch Strategy • Bait: Use Public Internet to develop and test and socialize new ways of extracting value • Switch: Apply learnings to larger and more valuable private linguistic repositories • Market sizing: translation business is too small for fortune-500 companies • Major lasting contribution of IBM-Approach: academic (machine learning) • MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA
Kay’s Workstation (1980)Proper Place of Men and Machines in MT • The translator’s [workstation] will not run before it can walk. • It will be called on only for that for which its masters have learned to trust it. • It will not require constant infusions of new ad hoc devices that only expensive vendors can supply. • It is a framework that will gracefully accommodate the future contributions • that linguistics and computer science are able to make. • One day it will be built because • its very modesty assures its success. • It is to be hoped that it will be built with taste by people • who understand languages and computers well enough to know how little it is that they know. AMTA
CWARC: Canadian Workplace Automation Research Center (1989) • A PC • Network access to the Termium terminology database CD-ROM • WorkPerfect • CompareRite (a diff tool) • TextSearch (a concordance tool) • Mercury/Termex (terminology) • Procomm (remote access to data banks via telephone modem) • Seconde Memoire (French verb inflections) • Software Bridge (tool for converting word processing files from one commercial format into another) AMTA
Good Apps for Crummy MT • It should set reasonable expectations • It should make sense economically • It should be attractive to the intended users • It should exploit the strengths of the machine • and not compete with the strengths of the human • It should be clear to the users what the system can and cannot do, and • It should encourage the field to move forward toward a sensible long-term goal. AMTA
Evaluation: MultiLingual 13:6A Trade Magazine for Translators AMTA
MultiLingual 13:6Very Positive on Tools • T-Remote Memory (p. 21) • Combination of translation memory, workflow & distributed work centers (work at home) • Moore’s Law: large revenues better technology • “Hold on to your seats, translators and agencies, our industry is about to change again… Forecast shakeup: extreme.” • Comparing Tools Used in Software Localization (p. 31): A Consumer Reports-like Review • Presupposition: tools are ready for wide-spread use • “Finally, I am told that there are people out there for whom price does matter.” • ATA Conference (~1990): the translator and the MT salesman AMTA
MultiLingual 13:6 is mostly positive on technology, but… • Working With Machine Translation (p. 37) • Although the formatting of the source text is extremely simple, the machine translation output requires a lot of painstaking post-editing • Grim reality: mark-up is more valuable than translation • The verdict is unanimous • The translators, who have gained a lot of experience of EU topics, prefer to work without Systran • A Look at Two Web Translation Portals (p. 49) • At first I thought it was a machine translation site..., but soon discovered that it was not one of those “free translation” sites. ≈ $0.25/word AMTA
Machine Translation + Post-Editing:Long History of Mixed Results • Positive: • Magusson-Murray (1985, p. 180): Although you can expect to at least doubleyour translator’s output, the real cost-saving in MT likes in complete electronic transfer of information and the integration into a fully electronic publishing system. • Lawson (1984, p. 6): Substantial rises in translations output, by as much as 75 per cent in one case, are being reported by users of the Logos machine translation (MT) system after only a few months. • Tschira (1985): For one type of text (data description manuals), we observed an increase in throughput of 30 per cent. AMTA
Surprisingly, automation (MT + Post-editing)can be more expensive than manual baseline • Negative: • Macklovitch (1991, p. 3): The HT production chain was significantly faster than the MT production chain. • Kay (1980): Proper Place of Men and Machines in MT • ALPAC (1966, p. 19): The postedited translation took slightly longer to do and was more expensive than conventional human translation… Dr. J. C. R. Licklider of IBM and Dr. Paul Garvin of Bunker-Ramo said they would not advise their companies to establish such a service. • Credibility gap: • Why so little consistency? 200%? 75%? 30%? 0%? • Why haven’t these products done better in the marketplace? • The tools argument (terminology and translation memories) works better with translators than post-editing • Translators may be biased • but they have considerable expertise (and influence) • Automation will be easier if they believe in it AMTA
OverviewHistorical rational reconstruction emphasizing empiricism & business • Before TMI-1992: How to Cook a Demo • TMI-1992 Debate: Rationalism v. Empiricism • Hybrid/Tools: • Kay’s Workstation • Good Apps for Crummy MT • Trados • What happened to the IBM-Approach to MT? • Support for human translators (Translation Tools: Trados) • Fully automatic apps (CLIR) • Academia (Machine Learning) • Revival of Empiricism: A Personal Perspective • The IBM-approach to MT was always controversial • But there are lots of less controversial spin-offs: • Tools, lexicography apps, word sense, machine learning • Future (AMTA-2012): Bait and Switch Strategy • Bait: Use Public Internet to develop and test and socialize new ways of extracting value • Switch: Apply learnings to larger and more valuable private linguistic repositories • Market sizing: translation business is too small for fortune-500 companies • Major lasting contribution of IBM-Approach: academic (machine learning) • MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA
What has happened to the IBM-Approach to Machine Translation? • Support for human translators (MultiLingual 13:6) • Terminology: translators don’t need help with the easy vocabulary and the easy grammar • Translation Memory: translators are often asked to translate the same material again and again (e.g., revisions of manuals) • Alignment • Fully automatic • CLIR: cross-language information retrieval • Translating web pages • Academic fields • Machine Learning: most important contribution • Corpus-based Lexicography: spreading into lots of other fields including politics (Nunberg) AMTA
Use of Political Labels in Major NewspapersGeoffrey NunbergCommentary broadcast on "Fresh Air," March 19, 2002 AMTA
Surprisingly, liberals are more likely to be labeled as such than conservatives • In fact, I [Nunberg] did find a big disparity in the way the press labels liberals and conservatives, • but not in the direction that Goldberg claims. • On the contrary: the average liberal legislator has a 30% greater likelihoods of being identified with a partisan label than the average conservative does. • The press describes • Barney Frank as a liberal 2.5 times as frequently • as it describes Dick Armey as a conservative. • It gives Barbara Boxer a partisan label • almost twice as often as it gives one to Trent Lott. • And while it isn't surprising that the press applies the label conservative to Jesse Helms more often than to any other Republican in the group, • it describes Paul Wellstone as a liberal • 20% more frequently than that. AMTA
1990s Revival of Empiricism • Empiricism was at its peak in the 1950s • Dominating a broad set of fields • Ranging from psychology (behaviorism) • To electrical engineering (information theory) • At the time, it was common practice in linguistics to classify words not only by meaning but also by collocations (word associations) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • Word Associations: bread and butter, doctor/nurse • Regrettably, interest in empiricism faded • with Chomsky’s criticism of ngrams in Syntactic Structures (1957) • and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • Availability of massive amounts of data (even before the web) • “More data is better data” • Quantity >> Quality (balance) • Pragmatic focus: • What can we do with all this data? • Better to do something than nothing at all • Empirical methods (and focus on evaluation): Speech Language AMTA
Shannon’s: Noisy Channel Model Language Model Channel Model • I Noisy Channel O • I΄ ≈ ARGMAXI Pr(I|O) = ARGMAXI Pr(I) Pr(O|I) AMTA
Using (Abusing) Shannon’s Noisy Channel Model: Part of Speech Tagging and Machine Translation • Speech • Words Noisy Channel Acoustics • OCR • Words Noisy Channel Optics • Spelling Correction • Intended Text Noisy Channel Typos • Part of Speech Tagging (POS): • POS Noisy Channel Words • Machine Translation: • English Noisy Channel French AMTA
Statistical MT • E Noisy Channel F • E΄= ARGMAXEPr(E) Pr(F|E) • Language Model, Pr(E): • Trigram model (borrowed from speech recog) • Channel Model, Pr(F|E): • Based on aligned parallel corpora • Models 1-5: alignment • Mercer & Church (Computational Linguistics, 1993) • Statistical MT may fail for reasons advanced by Chomsky • Regardless of its ultimate success or failure, • There is a growing community of researchers in corpus-based linguistics who believe it will produce valuable lexical resources • Bilingual concordances • Translation tools • Training & testing material for word sense disambig (senseval) AMTA
Word Sense Disambiguation • Knowledge Acquisition Bottleneck • Bar-Hillel (1960) • Expert systems don’t scale • Sense-tagged text: expensive • Parallel text! • Translation = sense-tagged text • Sentence (judicial sense) peine • Sentence (syntactic sense) phrase • Yarowsky: bilingual monolingual • One sense per discourse • Machine Learning: early example of co-training AMTA
Revival of Empiricism:A Personal Perspective • At MIT, I was solidly opposed to empiricism • But that changed soon after moving to AT&T Bell Labs (1983) • Letter-to-Sound Rules (speech synthesis) • Names: Letter stats Etymology Pronunciation video • NetTalk: Neural Nets video • Demo: great theater unrealistic expectations • Self-organizing systems v. empiricism • Machine Learning v. Corpus-based Linguistics • I did it, I did it, I did it, but… • Part of Speech Tagging (1988) • Word Associations (Hanks) • Mutual info collocations & word associations • Collocations: Strong tea v. powerful computers • Word Associations: bread and butter, doctor/nurse • Good-Turing Smoothing (Gale) • Aligning Parallel Corpora (inspired by MT) • Word Sense Disambiguation • Bilingual Monolingual • Even if IBM’s approach fails for MT lasting benefit (tools, linguistic resources, academic contributions to machine learning) AMTA
OverviewHistorical rational reconstruction emphasizing empiricism & business • Before TMI-1992: How to Cook a Demo • TMI-1992 Debate: Rationalism v. Empiricism • Hybrid/Tools: • Kay’s Workstation • Good Apps for Crummy MT • Trados • What happened to the IBM-Approach to MT? • Support for human translators (Translation Tools: Trados) • Fully automatic apps (CLIR) • Academia (Machine Learning) • Revival of Empiricism: A Personal Perspective • The IBM-approach to MT was always controversial • But there are lots of less controversial spin-offs: • Tools, lexicography apps, word sense, machine learning • Future (AMTA-2012): Bait and Switch Strategy • Bait: Use Public Internet to develop and test and socialize new ways of extracting value • Switch: Apply learnings to larger and more valuable private linguistic repositories • Market sizing: translation business is too small for fortune-500 companies • Major lasting contribution of IBM-Approach: academic (machine learning) • MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA
Strategy is Importantwww.elsnet.org • Our field is doing better and better! • It used to be hard to prepare a talk for a CL audience • because there was almost nothing that you could assume everyone knew. • The field will really have arrived when a course in speech and language processing is a normal part of every undergraduate and graduate Computer Science, Electronic Engineering, and Linguistics programme • and we're a long way from that • But things are improving… • Now have several textbooks: Manning & Schutze, Jurafsky and Martin • A quick search of the web: textbooks courses (around the world) • There were, of course, many other obstacles • that limited the size of the field. • It used to be hard to join in on the fun • because only a few large industrial labs could afford to collect data • Thanks to data collection efforts such as LDC and ELSNET and the web • Data is no longer the problem it used to be • Of course, you can never have too much of a good thing… • Tools also used to be a problem…. AMTA
Learn from Theoretical Computer Science • There is always, though, more • we could do to promote our field • Learn from Theoretical Computer Science • Theory has paid more attention to teaching • than we have • They have also worked hard on strategy • www.research.att.com/~dsj/nsflist.html • The theory community regularly exchange lists of open problems along with difficulty ratings • Students know before they solve a problem whether it is worth a conference paper or a superstar award AMTA
Strategy: not urgent, but important • Many orgs (.edu, .com, .gov) work hard on strategy • Plenty of examples on the web: • www.nsf.gov/pubs/2001/nsf0104/strategy.htm • www.darpa.mil/body/mission.html • medg.lcs.mit.edu/doyle/publications/sdcr96.pdf • www.gridforum.org/L_About/about.htm • Hard to say why strategy is important • But I have noticed, at least within AT&T, that groups that work hard on strategy have grown and prospered over the years • Strategy is never as urgent as the next conference paper deadline, but it is probably more important AMTA
Strategy documents have impact(even if it appears that they are being ignored) • Organizations may or may not follow their own recommendations • The discussion that produces the strategy document is extremely valuable, nevertheless, • perhaps more so that anything that happens after doc is finalized • Strategy panels offer a forum for people to meet • and look at the field from a broader perspective • In addition, the theory community has observed that even after the people involved in the original discussion have long since forgotten the outcome • Recommendations continue to live on • and broaden the best and most aggressive students for years AMTA
Strategy Discussions in Our Field • There are a few discussions of strategy within our field: • http://www.elsnet.org/about.html • http://www.ldc.upenn.edu/ldc/about/ldc_intro.html • http://www-nlpir.nist.gov/projects/duc/papers/ • LREC workshops (to order proceedings, see www.lrec-conf.org) • LDC link developed a decade ago • Largely responsible for the success of LDC • If more groups in our field put the same kind of energy into strategy, • There would be more success stories like the LDC • A delightful "near miss" is Martin Kay's reflections on ICCL and COLING • Establishes direction for the format of Coling conferences in a “classic” Martin-style • Proposal: convince Martin to write a doc in the same delightful style • Establish direction for the field rather than atmosphere for Coling AMTA
Bait and Switch Strategywww.elsnet.org • Bait: public Internet • Large, sexy, available, rich hypertext structure • Switch: as large as the web is • There are larger & more valuable private repositories • Private Intranets & telephone networks • Exclusivity Value • No one cares about data that everyone can have • Just as Groucho Marx doesn’t want to be in a club that… • Strategy: Use the public Intranet to develop, test and socialize new ways to extract value from large linguistic repositories • Value to society: Apply solutions to private repositories AMTA
Call Centers:An Intelligence Bonanza • Some companies are collecting information with technology designed to monitor incoming calls for service quality. • Last summer, Continental Airlines Inc. installed software from Witness Systems Inc. to monitor the 5,200 agents in its four reservation centers. • But the Houston airline quickly realized that the system, which records customer phone calls and information on the responding agent's computer screen, also was an intelligence bonanza, says André Harris, reservations training and quality-assurance director. AMTA
Bait: Use Web to establish: More data is better data • Shocking at TMI-92 (Mercer), but less so a decade later (Brill) • EMNLP-02 best paper: Using the Web to Overcome Data Sparseness • Larger corpora (Google) >> smaller corpora (British National Corpus) for predicting psycholinguistic judgements. • Suggested in the conclusions that web counts better than standard smoothing techniques (back-off) for language modelling • Really exciting! Performance on a broad range of computational linguistics tasks will improve as we collect more and more data • The rising tide of data will lift all boats! • Brill (AskMSR Question Answering): • One can do remarkably well in TREC question answering competitions by using a search engine like Google and very little else • Norvig (ACL-02 invited talk): ditto • Google is also very good at finding collocations/associations • http://labs1.google.com/sets • Cat & dog animals • Cat & more Unix commands! • We used to try to do similar things a decade ago, but the results where not as good, probably because we were working with relatively tiny corpora in the sub-billion-word range • Unix commands and many other subjects (esp taboo subjects) are over-represented on web • Quantity v. quality/corpus size v. balance • Is collecting more data better than smoothing? AMTA
How Large is Large? • Web Renewed Excitement • Large, rich hypertext structure & publicly available • Google = 1000 * BNC • Google: 100 Billion Words • British National Corpus (BNC): 100 Million Words • It is often said that the web is the largest repository but… • Changes to copyright laws could unlock vast quantities of data: www.lexisnexis.com • Private Intranets and telephone networks >> Public Web • FCC (trends.html): 200 million telephones in USA (1 line/person) • Usage: 1 hour/day/line • Assume 1 sec ≈ 1 word 10 Google collections/day • Currently, Intranets (data) ≈ telephones (voice) • But data is growing faster than voice • Admittedly, much of the data on Intranets cannot be distributed • And much of the speech on the telephone networks cannot be recorded • But attitudes are changing • It used to be considered rude to have a telephone answering machine • Now it is considered rude not to have one • Between answering machines and call centers, perhaps 10% can be recorded AMTA
In the past, recording all this data would have been prohibitively expensive • Thanks to Moore’s Law • Storage costs have been falling faster than transport • And will continue to do so for some time • Even at current prices, transport >> storage • Long-distance telephone calls: $0.05/min • Disk space: $0.005/min • If I am willing to pay for a call • I might as well keep the speech online for a long time • Similar comments hold for data (web pages) • If I am willing to pay to fetch a web page • I might as well cache it for a long time • Why flush a page if there is any chance that it might be requested again? • Web caches crawlers • Go find the pages that I might ask for and keep them forever • Storage is cheap (compared to transport) AMTA
RecommendationsBait and Switch Strategy • Papers: • Keep up the good work! • There is considerable interest in eval on corpora • There will be more interest in how well methods port to new corpora • More interest in how performance scales with size • Hopefully corpus size helps • but of course, all the data in the world will not solve all the world’s problems • Need to understand when more data will help • And when it is better to do something else • revival of linguistics AMTA
More Bait and Switch RecommendationsInvestments in infrastructure • In addition to traditional data collection efforts focused on publicly available linguistic repositories • We out to think about private repositories, as well. • Potential: Huge impact on size of private repositories • By making it more convenient to capture private data, and • By demonstrating that there is value in doing so. • For example, most of us do not keep voice mail for long • though I have been using Scanmail to copy voice mail to email • and like many people, I keep a lot of email online for a long time • Unfortunately, tools for searching email and other private repositories are not as good as the tools for searching public repositories (Google) AMTA
OverviewHistorical rational reconstruction emphasizing empiricism & business • Before TMI-1992: How to Cook a Demo • TMI-1992 Debate: Rationalism v. Empiricism • Hybrid/Tools: • Kay’s Workstation • Good Apps for Crummy MT • Trados • What happened to the IBM-Approach to MT? • Support for human translators (Translation Tools: Trados) • Fully automatic apps (CLIR) • Academia (Machine Learning) • Revival of Empiricism: A Personal Perspective • The IBM-approach to MT was always controversial • But there are lots of less controversial spin-offs: • Tools, lexicography apps, word sense, machine learning • Future (AMTA-2012): Bait and Switch Strategy • Bait: Use Public Internet to develop and test and socialize new ways of extracting value • Switch: Apply learnings to larger and more valuable private linguistic repositories • Market sizing: translation business is too small for fortune-500 companies • Major lasting contribution of IBM-Approach: academic (machine learning) • MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA
Market Opportunities for Translation • CACM (with Rau): Commercial Opportunities for NLP • Do we count garage outfits funded by grants? • Fortune-500 perspective: min of $100 million • Identified two application areas • Word Processing (Microsoft) • Information Retrieval (Lexis-Nexis/Web) • Did not identify translation • How large is translation market? Huge estimates: • $10+ Billion (Eurolang) • Comparable to AT&T’s revenues for consumer services • Telcos are a major employer (unlike translation) • Estimates of Market Size • English-only (ASCII): • Pairs of English Speakers • Monolingual (ISO/Unicode) • Pairs of speakers who share a language • Multi-lingual (translation) • Pairs of speakers Multi-lingual Monolingual English AMTA
Surprisingly Little Demand for Multi-lingual Applications: Translation & Interpretation • AT&T Language Line Lesson: • Surprisingly, monolingual market >> multi-lingual market • Lots of demand for telephone service where both parties speak the same language • We thought there would be even more demand for a translating telephone • Because there are more pairs of people who don’t share a common language than do • But people don’t talk (much) to people they don’t know • AT&T Language Line Service: • Speech to speech interpretation over the phone (low tech except conf calling/work-at-home) • Plus a traditional writing to writing translation service • Interpretation market: focused on emergencies: police, hospital (too small for AT&T) • Translation market: focus on technical manuals (also, too small for AT&T) • Surprise: interpretation market ≠ translation market; demand for language pair • Interpretation (speech to speech): depends on number of domestic speakers • Translation (writing to writing): depends on world-wide GNP • Putnam: Bowling Alone Bridging and Bonding • Employment opportunities for translators • Not Good • Markup is more valuable than translation • Desktop publishing is a better business • Business case: Adobe >> Trados Multi-lingual Monolingual English AMTA
SummaryHistorical rational reconstruction emphasizing empiricism & business • Before TMI-1992: How to Cook a Demo • TMI-1992 Debate: Rationalism v. Empiricism • Hybrid/Tools: • Kay’s Workstation • Good Apps for Crummy MT • Trados • What happened to the IBM-Approach to MT? • Support for human translators (Translation Tools: Trados) • Fully automatic apps (CLIR) • Academia (Machine Learning) • Revival of Empiricism: A Personal Perspective • The IBM-approach to MT was always controversial • But there are lots of less controversial spin-offs: • Tools, lexicography apps, word sense, machine learning • Future (AMTA-2012): Bait and Switch Strategy • Bait: Use Public Internet to develop and test and socialize new ways of extracting value • Switch: Apply learnings to larger and more valuable private linguistic repositories • Market sizing: translation business is too small for fortune-500 companies • Major lasting contribution of IBM-Approach: academic (machine learning) • MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA