350 likes | 423 Views
What's Happened Since the First SIGDAT Meeting?. Kenneth Ward Church AT&T Labs-Research kwc@research.att.com. The First SIGDAT Meeting. WVLC-1 was held just before ACL-93 Great turnout! More like a conference than a workshop We knew that corpora were “hot,”
E N D
What's Happened Since the First SIGDAT Meeting? Kenneth Ward Church AT&T Labs-Research kwc@research.att.com
The First SIGDAT Meeting • WVLC-1 was held just before ACL-93 • Great turnout! • More like a conference than a workshop • We knew that corpora were “hot,” • but didn't appreciate just how hot they would turn out to be.
Sister meetings have also done very well since 1993 • Information Retrieval • http://www.acm.org/sigir/ • Digital Libraries • http://fox.cs.vt.edu/DL99/ • Machine Learning • http://www.cs.cmu.edu/Web/Groups/NIPS • Data-mining, Databases, Data Warehousing • http://www.acm.org/sigkdd/ • http://www.vldb.org/
Empiricism has a long history • In the 1950’s, empiricism dominated a broad set of fields: • from psychology (behaviorism) • to electrical engineering(information theory). • At the time, it was common practice in linguistics to classify words not only on the basis of their meanings • but also on the basis of their co-occurrence with other words. • ``You shall know a word by the company it keeps” (Firth, 1957) • Regrettably, interest in empiricism faded in the 1960’s: • Chomsky's criticism of ngrams in Syntactic Structures (1957) and • Minsky and Papert's criticism of neural nets in Perceptrons (1969).
1990’s Revival • Empiricism regained a dominant position: • Ngrams and Hidden Markov Models (HMMs) became the method of choice in Speech. • Neural Networks (Perceptrons + Hidden Layers) helped create Machine Learning. • Empiricism Rationalism Empiricism • Oscillates about once a career • Mark Twain: Grandparents and Grandchildren have a natural alliance.
Why the Revival?“It was a bad idea then, and it is still a bad idea now” • More powerful computers?? • Availability of massive quantities of data!! • Text is available like never before. • Not long ago, the Brown Corpus was considered large. • But now, text is available like never before! • First came collection efforts (www.ldc.upenn.org), • And now everyone has access to the Web! • Experiments are routinely carried out on gigabytes of text. • Some researchers are even working with terabytes.
Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
The Web, Stupid! • If you publish a paper about neat stuff, it is expected that you will post it on the web. • I’ll mention just a few examples of neat stuff on the web. • Demos • Data • Tools
Lots of Neat Demos on the Web • Web Searching with Machine Translation • www.altavista.com(uses Systran) • Cross-Language Information Retrieval (CLIR): • www.xrce.xerox.com • Parallel Corpora: www-rali.iro.umontreal.ca • Latent Semantic Indexing (LSI) • superbook.bellcore.com/~remde/lsi • lsa.colorado.edu • Speech Synthesis: www.bell-labs.com/project/tts • Dotplot: www.cs.unm.edu/~jon/dotplot
Lots of Neat Data on the Web • Wordnet: www.cogsci.princeton.edu/~wn • Linguistic Data Consortium (LDC): • www.ldc.upenn.org • SIGLEX: www.clres.com/siglex.html • Discourse Resource Initiative (DRI) • www.georgetown.edu/luperfoy/Discourse-Treebank/dri-home.html • The Federalist Papers: • www.mcs.net/~knautzr/fed
More Neat Data on the Web(in Lots of Languages) • Chinese: • rocling.iis.sinica.edu.tw • www.sinica.edu.tw • Japanese: cl.aist-nara.ac.jp/lab/resource/resource.html • Electronic Dictionary Research (EDR): www.iijnet.or.jp/edr • Advanced Telecommunications Research (ATR): www.atr.co.jp • www.rdt.monash.edu.au/~jwb/japanese.html • Korean: korterm.kaist.ac.kr • European Language Resources Association (ELRA) • www.icp.grenet.fr/ELRA • Parallel Text (Resnik, ACL-99) • Canadian Hansards: WWW.Parl.GC.CA • Turkish: www.nlp.cs.bilkent.edu.tr • Swedish: svenska.gu.se
Lots of Neat Tools on the Web • Penntools (links to all over the world) • www.cis.upenn.edu/~adwait/penntools.html • Part of Speech Taggers (see above) • Juman/Chasen • pine.kuee.kyoto-u.ac.jp/nl-resource/juman.html • cl.aist-nara.ac.jp/lab/nlt/chasen.html • Suffix Arrays • http://cm.bell-labs.com/cm/cs/who/doug/ssort.c
Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
Shared Resources + Evaluation • Common tasks: • Trec (trec.nist.gov), Tipster, MUC • Common benchmark corpora: Brown, Penn Treebank, Wall Street Journal, Switchboard • Shared lexical resources: Wordnet (www.cogsci.princeton.edu/~wn/) • Common labeling conventions/standards in all areas of NLP from Speech to Discourse • Evaluation, evaluation, evaluation • Required to get a paper accepted anywhere.
In 1993, it wasn’t like this... • Invited talks at ACL-93 • “Planning Multimodal Discourse” • “Transfers of Meaning” • “Quantificational Domains and Recursive Contexts” • Less sharing of resources • Evaluation not required
Empiricism vs. Rationalism • Pluses: Clear measurable progress • Speech Recognition • Part of Speech Tagging • Parsing • Minuses: Herd mentality, incrementalism, mindless metrics, duplicated effort • Recall: empiricism fell out of favor in 1960s when methodology became too burdensome.
Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
Main Street:Big change since 1993 • Large corpora are now having an impact on ordinary users: • Web search engines/portals • Managing gigabytes, not just a popular book, but something that ordinary users are beginning to take for granted.
Huge Commercial Successes(Since 1993) • Information Retrieval & Digital Libraries • Web search engines/portals: highly successful on both Wall Street as well as Main Street • Invited talks from Lycos (1997) & Infoseek (1998) • Machine Translation & Speech • Available wherever software is sold • Can’t use a phone without talking to a computer
Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
Mirror, mirror on the wall • Who is the largest of them all? • The Web? • Lexis-Nexis? • West? • We have had invited talks from all three • Web: Lycos (1997) & Infoseek (1998) • Lexis-Nexis (1993) • West (1997)
Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
Internationalization • SIGDAT-93: Nearly equal participation • America : 4 papers • Asia: 4 papers • Europe: 3 papers • Great growth in activity around the world, especially Asia • SIGDAT has met in a dozen cities (50% in America) • America: Columbus, Cambridge, Philadelphia, Providence, Montreal, College Park • Asia: Kyoto, Beijing, Hong Kong • Europe: Dublin, Copenhagen, Grenada
Some Topics that are Behind the International Expansion • Classic Issues • Machine Translation (MT) / Tools • Input Method Editor (IME): MS-IME98 • Morphology: Juman, Chasen • New Issues • Cross-language Information Retrieval (CLIR) • Browsing the Internet: integrate IME + CLIR + MT • Parallel and comparable corpora • Terminology Extraction & Alignment • Suffix Arrays
Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street
Broader (and More Applied) View of Computational Linguistics • Data-mining, Databases, Data Warehousing • Digital Libraries • Information Retrieval, Categorization, Extraction • Lexicography • Machine Learning • Machine Translation • Speech • Text Analysis
Data-Mining Issues(How Large is Very Large?) • Similar technology to corpus-based methods • But much larger datasets • Newswire (AP): 1 million words per week • Telephone calls: 1-10 billion per month • IP packets: expected to be even larger • Tasks: Fraud, Marketing, Operations, Care • Identify knobs that business partners can turn • Increase demand (buy TV ads, reduce price) • Increase supply (buy network capacity, enhance operations) • Target opportunities for improvement (marketing prospects) • Track market response in real time (supply/demand by knob)
Best of SIGDAT • Best Invited Talk • Work of Note • Work of Note (in Related Fields)
Best Invited Talkat a SIGDAT Meeting • Henry Kučera and Nelson Francis • Third Workshop on Very Large Corpora (1995) • Massachusetts Institute of Technology (MIT) • Cambridge, MA, USA • Described their work on the Brown Corpus • At a time when empiricism was out of fashion • especially at MIT • Personal & Touching (received standing ovation)
Work of Note • Statistical Machine Translation / Alignment • Brown et al. • Statistical Parsing (In 1993, poor use of lexical info) • Jelinek, Magerman, Charniak, Collins • Statistical PP Attachment • Hindle and Rooth • Word-sense Disambiguation • Yarowsky • Text-tiling (Discourse Parsing) • Hearst
Work of Note(in Related Fields) • Learning • Classification and Regression Trees (CART) • Riper • Web Tools • Managing Gigabytes, Harvest, SGML XML • Representation • Suffix Arrays • Latent Semantic Indexing
Summary:Reaching a Wider Audience • Commercial Successes • Main Street & Wall Street • Internationalization • Goal: equal rep from America, Asia & Europe • More topic areas • Information Retrieval, Speech, Machine Translation, Machine Learning, Data-mining
Self-organizing vs. EDA • Self-organizing: Learning, HMM • Statistics do it all • Manual • Wilks’ Stone Soup: Statistics don’t do nothing • Exploratory Data Analysis (EDA) • Hybrid of above
Time for a little controversy:Two types of Empiricism • New Linguistic Insights vs. Methodology • Reviewers do what reviewers do • Safe, conservative, seek precedents, case law • Reviewers go easy on methodology papers • Grim historical reminder: • Recall: empiricism fell out of favor in 1960s when methodology became too burdensome. • Shouldn’t let the methodology get in the way of what we are here to do.