1 / 35

What's Happened Since the First SIGDAT Meeting?

What's Happened Since the First SIGDAT Meeting?. Kenneth Ward Church AT&T Labs-Research kwc@research.att.com. The First SIGDAT Meeting. WVLC-1 was held just before ACL-93 Great turnout! More like a conference than a workshop We knew that corpora were “hot,”

hiroko
Download Presentation

What's Happened Since the First SIGDAT Meeting?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What's Happened Since the First SIGDAT Meeting? Kenneth Ward Church AT&T Labs-Research kwc@research.att.com

  2. The First SIGDAT Meeting • WVLC-1 was held just before ACL-93 • Great turnout! • More like a conference than a workshop • We knew that corpora were “hot,” • but didn't appreciate just how hot they would turn out to be.

  3. Sister meetings have also done very well since 1993 • Information Retrieval • http://www.acm.org/sigir/ • Digital Libraries • http://fox.cs.vt.edu/DL99/ • Machine Learning • http://www.cs.cmu.edu/Web/Groups/NIPS • Data-mining, Databases, Data Warehousing • http://www.acm.org/sigkdd/ • http://www.vldb.org/

  4. Empiricism has a long history • In the 1950’s, empiricism dominated a broad set of fields: • from psychology (behaviorism) • to electrical engineering(information theory). • At the time, it was common practice in linguistics to classify words not only on the basis of their meanings • but also on the basis of their co-occurrence with other words. • ``You shall know a word by the company it keeps” (Firth, 1957) • Regrettably, interest in empiricism faded in the 1960’s: • Chomsky's criticism of ngrams in Syntactic Structures (1957) and • Minsky and Papert's criticism of neural nets in Perceptrons (1969).

  5. 1990’s Revival • Empiricism regained a dominant position: • Ngrams and Hidden Markov Models (HMMs) became the method of choice in Speech. • Neural Networks (Perceptrons + Hidden Layers) helped create Machine Learning. • Empiricism  Rationalism  Empiricism • Oscillates about once a career • Mark Twain: Grandparents and Grandchildren have a natural alliance.

  6. Why the Revival?“It was a bad idea then, and it is still a bad idea now” • More powerful computers?? • Availability of massive quantities of data!! • Text is available like never before. • Not long ago, the Brown Corpus was considered large. • But now, text is available like never before! • First came collection efforts (www.ldc.upenn.org), • And now everyone has access to the Web! • Experiments are routinely carried out on gigabytes of text. • Some researchers are even working with terabytes.

  7. Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street

  8. The Web, Stupid! • If you publish a paper about neat stuff, it is expected that you will post it on the web. • I’ll mention just a few examples of neat stuff on the web. • Demos • Data • Tools

  9. Lots of Neat Demos on the Web • Web Searching with Machine Translation • www.altavista.com(uses Systran) • Cross-Language Information Retrieval (CLIR): • www.xrce.xerox.com • Parallel Corpora: www-rali.iro.umontreal.ca • Latent Semantic Indexing (LSI) • superbook.bellcore.com/~remde/lsi • lsa.colorado.edu • Speech Synthesis: www.bell-labs.com/project/tts • Dotplot: www.cs.unm.edu/~jon/dotplot

  10. Lots of Neat Data on the Web • Wordnet: www.cogsci.princeton.edu/~wn • Linguistic Data Consortium (LDC): • www.ldc.upenn.org • SIGLEX: www.clres.com/siglex.html • Discourse Resource Initiative (DRI) • www.georgetown.edu/luperfoy/Discourse-Treebank/dri-home.html • The Federalist Papers: • www.mcs.net/~knautzr/fed

  11. More Neat Data on the Web(in Lots of Languages) • Chinese: • rocling.iis.sinica.edu.tw • www.sinica.edu.tw • Japanese: cl.aist-nara.ac.jp/lab/resource/resource.html • Electronic Dictionary Research (EDR): www.iijnet.or.jp/edr • Advanced Telecommunications Research (ATR): www.atr.co.jp • www.rdt.monash.edu.au/~jwb/japanese.html • Korean: korterm.kaist.ac.kr • European Language Resources Association (ELRA) • www.icp.grenet.fr/ELRA • Parallel Text (Resnik, ACL-99) • Canadian Hansards: WWW.Parl.GC.CA • Turkish: www.nlp.cs.bilkent.edu.tr • Swedish: svenska.gu.se

  12. Lots of Neat Tools on the Web • Penntools (links to all over the world) • www.cis.upenn.edu/~adwait/penntools.html • Part of Speech Taggers (see above) • Juman/Chasen • pine.kuee.kyoto-u.ac.jp/nl-resource/juman.html • cl.aist-nara.ac.jp/lab/nlt/chasen.html • Suffix Arrays • http://cm.bell-labs.com/cm/cs/who/doug/ssort.c

  13. Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street

  14. Shared Resources + Evaluation • Common tasks: • Trec (trec.nist.gov), Tipster, MUC • Common benchmark corpora: Brown, Penn Treebank, Wall Street Journal, Switchboard • Shared lexical resources: Wordnet (www.cogsci.princeton.edu/~wn/) • Common labeling conventions/standards in all areas of NLP from Speech to Discourse • Evaluation, evaluation, evaluation • Required to get a paper accepted anywhere.

  15. In 1993, it wasn’t like this... • Invited talks at ACL-93 • “Planning Multimodal Discourse” • “Transfers of Meaning” • “Quantificational Domains and Recursive Contexts” • Less sharing of resources • Evaluation not required

  16. Empiricism vs. Rationalism • Pluses: Clear measurable progress • Speech Recognition • Part of Speech Tagging • Parsing • Minuses: Herd mentality, incrementalism, mindless metrics, duplicated effort • Recall: empiricism fell out of favor in 1960s when methodology became too burdensome.

  17. Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street

  18. Main Street:Big change since 1993 • Large corpora are now having an impact on ordinary users: • Web search engines/portals • Managing gigabytes, not just a popular book, but something that ordinary users are beginning to take for granted.

  19. Huge Commercial Successes(Since 1993) • Information Retrieval & Digital Libraries • Web search engines/portals: highly successful on both Wall Street as well as Main Street • Invited talks from Lycos (1997) & Infoseek (1998) • Machine Translation & Speech • Available wherever software is sold • Can’t use a phone without talking to a computer

  20. Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street

  21. How Large is Very Large?

  22. Mirror, mirror on the wall • Who is the largest of them all? • The Web? • Lexis-Nexis? • West? • We have had invited talks from all three • Web: Lycos (1997) & Infoseek (1998) • Lexis-Nexis (1993) • West (1997)

  23. Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street

  24. Internationalization • SIGDAT-93: Nearly equal participation • America : 4 papers • Asia: 4 papers • Europe: 3 papers • Great growth in activity around the world, especially Asia • SIGDAT has met in a dozen cities (50% in America) • America: Columbus, Cambridge, Philadelphia, Providence, Montreal, College Park • Asia: Kyoto, Beijing, Hong Kong • Europe: Dublin, Copenhagen, Grenada

  25. Some Topics that are Behind the International Expansion • Classic Issues • Machine Translation (MT) / Tools • Input Method Editor (IME): MS-IME98 • Morphology: Juman, Chasen • New Issues • Cross-language Information Retrieval (CLIR) • Browsing the Internet: integrate IME + CLIR + MT • Parallel and comparable corpora • Terminology Extraction & Alignment • Suffix Arrays

  26. Big Changes Since 1993 • The Web, stupid! • Demos • Data • Research: • Shared resources + evaluation • Scale: How large is verylarge? • Increased breadth: Geography, Topics • Commercial: Wall Street & Main Street

  27. Broader (and More Applied) View of Computational Linguistics • Data-mining, Databases, Data Warehousing • Digital Libraries • Information Retrieval, Categorization, Extraction • Lexicography • Machine Learning • Machine Translation • Speech • Text Analysis

  28. Data-Mining Issues(How Large is Very Large?) • Similar technology to corpus-based methods • But much larger datasets • Newswire (AP): 1 million words per week • Telephone calls: 1-10 billion per month • IP packets: expected to be even larger • Tasks: Fraud, Marketing, Operations, Care • Identify knobs that business partners can turn • Increase demand (buy TV ads, reduce price) • Increase supply (buy network capacity, enhance operations) • Target opportunities for improvement (marketing prospects) • Track market response in real time (supply/demand by knob)

  29. Best of SIGDAT • Best Invited Talk • Work of Note • Work of Note (in Related Fields)

  30. Best Invited Talkat a SIGDAT Meeting • Henry Kučera and Nelson Francis • Third Workshop on Very Large Corpora (1995) • Massachusetts Institute of Technology (MIT) • Cambridge, MA, USA • Described their work on the Brown Corpus • At a time when empiricism was out of fashion • especially at MIT • Personal & Touching (received standing ovation)

  31. Work of Note • Statistical Machine Translation / Alignment • Brown et al. • Statistical Parsing (In 1993, poor use of lexical info) • Jelinek, Magerman, Charniak, Collins • Statistical PP Attachment • Hindle and Rooth • Word-sense Disambiguation • Yarowsky • Text-tiling (Discourse Parsing) • Hearst

  32. Work of Note(in Related Fields) • Learning • Classification and Regression Trees (CART) • Riper • Web Tools • Managing Gigabytes, Harvest, SGML  XML • Representation • Suffix Arrays • Latent Semantic Indexing

  33. Summary:Reaching a Wider Audience • Commercial Successes • Main Street & Wall Street • Internationalization • Goal: equal rep from America, Asia & Europe • More topic areas • Information Retrieval, Speech, Machine Translation, Machine Learning, Data-mining

  34. Self-organizing vs. EDA • Self-organizing: Learning, HMM • Statistics do it all • Manual • Wilks’ Stone Soup: Statistics don’t do nothing • Exploratory Data Analysis (EDA) • Hybrid of above

  35. Time for a little controversy:Two types of Empiricism • New Linguistic Insights vs. Methodology • Reviewers do what reviewers do • Safe, conservative, seek precedents, case law • Reviewers go easy on methodology papers • Grim historical reminder: • Recall: empiricism fell out of favor in 1960s when methodology became too burdensome. • Shouldn’t let the methodology get in the way of what we are here to do.

More Related