470 likes | 573 Views
Large-scale Knowledge Resources in Speech and Language Research. Mark Liberman University of Pennsylvania myl@cis.upenn.edu LKR2004 3/8/2004. Outline. Glimpse of LKR in the U.S. landscape
E N D
Large-scale Knowledge Resources in Speech and Language Research Mark LibermanUniversity of Pennsylvaniamyl@cis.upenn.edu LKR2004 3/8/2004
Outline • Glimpse of LKR in the U.S. landscape • What is the relationship betweenlarge-scale knowledge resourcesand research and developmenton speech and language? • What are some needs and opportunities? • What are the trends? • Illustrative examples LKR2004
Glimpses of the U.S. LKR landscape • DARPA research areas • Human Language Technology • Cognitive Information Processing • NSF initiatives • Digital Libraries • ITR, Human Social Dynamics • “terascale linguistics” • Biomedical research: • text, ontologies, databases, experiments • collaborations with Japan and Europe • Language documentation • Web archives in many disciplines • ...too many other things to list... LKR2004
What is the relationship between large-scale knowledge resourcesand research and development on speech and language? Speech and language R&D needs LKR Modeling text: 104-106 words in 1975, 109-1012 words todayModeling speech: 1-10 hours in 1975, 103-104 hours today+ lexicons, parallel text, DBs for entity tracking, etc.+ a thousand languages and dialects+ history, social variation, register and genre, ... Speech and language R&D creates LKR see above. but also something entirely new... LKR2004
Some needs and opportunities • Standards and tools for LKR • for creation, improvement, maintenance • for publication, distribution, archiving • for search, access and use • An academic culture that rewards production and distribution of LKR • most LKR are a side effect of individual and small-group research • virtual “meta-resources” from many sources • Part of the answer: integrate LKR into the system of (scientific and scholarly) publication LKR2004
Themes and trends • A New Empiricism focus on large-scale resources, because quantity (of data) → quality (of knowledge) • Language + Life = Meaning something new emerges from large collections of symbols, signals, contexts, connections • People and machines: better together • cognitive prosthetics • interactive working, playing and learning • Failure is the basis for success if we can measure error, we can learn to improve LKR2004
Some illustrative examples... LKR2004
A famous argument (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless.“. . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally ‘remote’ from English. Yet (1), though nonsensical, is grammatical, while (2) is not.” Noam Chomsky, “Syntactic Structures” (1957) LKR2004
But is it true? LKR2004
43 years later • someone finally checked... • Pereira, “Formal grammar and information theory” (2000) • simple “aggregate bigram model” using hidden class variables c • with C=16, trained on ~100MW of newswire data • the result: "Furiously sleep green ideas colorless"is more than 200,000 times less probable than“Colorless green ideas sleep furiously” LKR2004
What changed? • Partly: • new models and estimation methods • better computing resources • more accessible data • Mostly: • willingness to look for solutions • opportunities to apply them To be fair, this kind of modeling became a real option only about 1980 Now it can be done as an undergraduate term project ... LKR2004
Social structure from conversation • Human social dynamics: model of conversational turn-taking • U.S. Supreme Court oral arguments • Modeling is simple and local • one session modeled at a time (~250 turns) • data is just sequence of (~250) speaker IDs • Undergraduate term project in intro course(credit to: Chris Osborn) LKR2004
CHIEF JUSTICE WILLIAM H. REHNQUIST: We'll hear argument next in No. 01-298, Paul Lapides v. the Board of Regents of the University System of Georgia. Spectators are admonished, do not talk until you get outside the courtroom. The court remains in session. Mr. Bederman. MR. DAVID J. BEDERMAN: Mr. Chief Justice, and may it please the Court: When a State affirmatively invokes the jurisdiction of the Federal court by removing a case, that acts as a waiver of the State's forum immunity to Federal jurisdiction under the Eleventh Amendment. This principle ... JUSTICE ANTONIN SCALIA: When you say as an actor in any role, does it ever intervene as a defendant? MR. BEDERMAN: Yes, Justice Scalia. This Court's precedents seem to indicate that wherever the State is cast in the role of plaintiff, defendant, intervenor, or claimant, that the entry into the Federal proceeding submits the State to the jurisdiction of the Federal court. CHIEF JUSTICE REHNQUIST: How about the Ford Motor Company case? MR. BEDERMAN: Well, of course, the authorization requirement in Ford Motor -- and that's the particular holding in Ford Motor that I think is of concern to this Court -- need not be reached here because, of course, ... CHIEF JUSTICE REHNQUIST: So, you think a line can be drawn between the State defendant being drawn in as a respondent or involuntarily as opposed to removing and thereby invoking Federal jurisdiction. + ... 254 turns ... LKR2004
Two-class “aggregate bigram model”, trained on a single one-hour argument (01-298), highest-probability class for each speaker: class 1 = ( chief justice william h. rehnquist justice anthony kennedy justice antonin scalia justice john paul stevens justice ruth bader ginsburg justice sandra day o'connor justice stephen g. breyer ) class 2 = ( mr. david j. bederman mr. irving l. gornstein ms. devon orland ms. julie c. parsley) ) LKR2004
So human social roles can emerge from a trivial statistical model of speaker sequencing in a formal setting. and sometimes you don’t need a lot of data. ...though in this case, it was crucial that Jerry Goldman’s Oyez Project is publishing all Supreme Court oral arguments (audio and transcripts) In most cases the quantity of data is crucial:Data quantity → knowledge quality ... and available resources are just starting to pass a threshold LKR2004
A case where size matters... • English complex nominals:sequence of nouns and adjectives, e.g.Volume Feeding Management Success Formula Award • Part-of-speech string offers little help in parsing:[ stone [ traffic barrier ]][[ job growth ] statistics ] N N N • Apparently, parsing requires “understanding” LKR2004
The MEDLINE corpus • U.S. National Library of Medicine • ~12 million references and abstracts • biomedical journal articles • 1966 to present • ~109 words LKR2004
Parsing by counting (in MEDLINE) LKR2004
Parsing by counting (google hits) First attempt at this idea: for AT&T TTS in 1987 First real success: ~15 years later The difference: It doesn’t really work with 107-108 tokens It works pretty well with 109-1012 tokens • “You can observe a lot just by watching.” • -Yogi Berra • here... “You can analyze a lot just by counting.” LKR2004
As the SCOTUS example suggests, “large-scale” is not just the number of words or hours. Structure, context and external relationships can also be crucial – here it was the sequence of speaker identities. Here’s a simple but compelling example of how symbol-like structure emerges as zebra finches practice a song... This is research by Ofer Tchernichovski (CCNY), Partha Mitra and others LKR2004
Frequency (Hz) 0 Time (ms) 700 Zebra finch song learningOfer Tchernichovski (CCNY) 8 LKR2004
Song imitation – young birds imitate adults Tutor’s song Pupil’s song LKR2004
0 20 40 60 80 100 Age(days) Song imitation * Can be very accurate * Critical period – developmental learning * Song template – memory traces of a model * Learning requires auditory feedback Sensory-motor phase Sensory phase LKR2004
Initially: Social & acoustic isolation Days 35 / 43 / 60: Start training LKR2004
The training system Laboratory of Animal Behavior, CCNY LKR2004
Wiener entropy - + Pure tone Noise Spectral continuity - + Low High Pitch FM - - + + Low Low High High Real-time calculation of acoustic features 4 simple acoustic features with articulatory correlates: LKR2004
Dynamic Vocal Development maps LKR2004
Development Dynamic Vocal Development (DVD) Mapof a single bird Day 85 Day 75 Day 65 Day 55 Day 45 Onset of training Day 35 LKR2004
Language + Life = Meaning • Text (and speech) structured by: • conversational context • time, place, sequence, participants, ... • content • types and identities of referenced entities • explicit links (anaphora, references, hyperlinks) • implicit links (quotation, imitation, opposition) • other contextual data • e.g. neurological, gene expression data in birdsong learning • gaze, gesture, posture, physiological data in conversation LKR2004
A small application:real conversational transcription • Perfect automatic speech-to-text (STT) yields: • ew very nice yes that’s that’s the ah first car uh well my first ownership of something major that’s cool i had to buy my car my other car burned down so it was my first brand new car uh-huh but i love it so i am very happy • STT + “metadata” yields “Rich Transcription”: • Speaker 1:Very nice. • Speaker 2: Yes. That’s my first ownership of something major. • Speaker 1:That’s cool. I had to buy my car. My other car burned down. It was my first brand new car. • Speaker 2: Uh-huh. • Speaker 1:But I love it. I am very happy. LKR2004
One aspect of conversational metadata: Diarization Goal: Label acoustic “sources” and their attributes • speakers, music, noise, DTMF, background events LKR2004
Interactive annotation • Supervised learning:human annotates, machine learns • Unsupervised learning:machine looks for structure in raw data • Semi-supervised learning:human annotates a few examples, machine tries to generalize • “Active learning”:machine selects cases that are interesting or uncertain, asks for human judgments • Sampling experimentshuman checks machine annotation of selected cases, apply sample confusion matrix to estimate overall statistics LKR2004
The cycle of interactive annotation Hand Annotation Hand Correction Machine Learning Automaticannotation (Selective) Sampling/ Labeling LKR2004
POS taggertrained on WSJ applied to MEDLINE: LKR2004
Same tagger,after retraining... (~200 MEDLINE abstracts): LKR2004
The key to success: learn to measure failure... Even a badly flawed measure can produce important gains. LKR2004
100% Percent of Human 90% 80% 70% 60% 50% 2002 2003 One year of quantitative evaluation... Arabic to English 89% Best Research System Best COTS System 58% 57% 51% LKR2004
Scoring Method Machine Translation Score Percent of Human = ——————————— x 100 Human Translation Score Translation Score = Weighted sum of n-gram matches between translation being scored (human or machine) and three good referencetranslations Reference translation:The U.S. island of Guam is maintaining a high state of alert after the Guamairport and itsoffices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such asthe airport . Tri-gram match Uni-gram match Bi-gram match Machine translation:The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. LKR2004
insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment . And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " . Certain are " the lines is air Libyan I will start also in of three trips running weekly to Cairo in the coordination with Egypt for flying " . Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. " The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ". The Libyan Arab Airways will also in the conduct of the three times a week in Cairo in coordination with egyptair ". Best System Outputs 2002 2003 LKR2004
Egypt Air May Resume its Flights to Libya Tomorrow Cairo, April 6 (AFP) - An Egypt Air official announced, on Tuesday, that Egypt Air will resume its flights to Libya as of tomorrow, Wednesday, after the UN Security Council had announced the suspension of the embargo imposed on Libya. The official said that, "the company sent a letter to the Ministry of Foreign Affairs to inquire about the lifting of the air embargo on Libya, and in the event that it receives a response, then the first flight to Libya, will take off, Wednesday morning." He stressed that "the Libyan Airlines will begin scheduling three weekly flights to Cairo, in coordination with Egypt air." Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. " The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ". The Libyan Arab Airways will also in the conduct of the three times a week in Cairo in coordination with egyptair ". Human v. Machine Human 2003 LKR2004
Summary • Speech and Language Research • needs LKR • creates LKR • can help other disciplines deal with LKR • is helped by other disciplines, who provide • raw data as well as relevant LKR pieces • problems, algorithms, inspiration • The whole is greater than the sum of the parts • Types, sources and amounts of data • Collaboration within and across disciplines • Cooperation of humans and machines LKR2004