220 likes | 247 Views
Introduction to the Language Technologies Institute. Fall, 2008 Jaime Carbonell jgc@cs.cmu.edu. School of Computer Science at Carnegie Mellon University. Computer Science Department (theory, systems) Robotics Institute (space, industry, medical)
E N D
Introduction to the Language Technologies Institute Fall, 2008 Jaime Carbonell jgc@cs.cmu.edu
School of Computer Science at Carnegie Mellon University • Computer Science Department (theory, systems) • Robotics Institute (space, industry, medical) • Language Technologies Institute(MT, speech, IR) • Human-Computer Interaction Inst. (Ergonomics) • Institute for Software Research Int. (SE) • Machine Learning Department (ML theory) • Entertainment Technologies (Animation, graphics)
Language Technologies Institute • Founded in 1986 as the Center for Machine Translation (CMT). • Became Language Technologies Institute in 1996, unifying CMT, Comp Ling program. • Current Size: 197 FTEs • 27 Faculty (including joint appointments) • 25 Staff • 125 Graduate Students (90 PhD, 40 MLT) • 10 Visiting Scholars
LTI Bill of Rights • Get the right information • To the right people • At the right time • On the right medium • In the right language • At the right level of detail
…right information …right people …right time …right medium …right language …right detail IR, filtering, TC, … routing, personalization, … anticipatory analysis, … text, speech, video, … translation, bio, … summarization, expansion Slogan Challenges
“…on the Right Medium” • Speech Recognition • SPHINX (Reddy, Rudnicky Rosenfeld, …) • JANUS (Waibel, Schultz, …) • Speech Synthesis • Festival (Black, Lenzo) • Handwriting & Gesture Recognition • ISL (Waibel, J. Yang) • Multimedia Integration (CSD) • Informedia (Wactlar, Hauptmann, …)
“… in the Right Language” • High-Accuracy Interlingual MT • KANT (Nyberg, Mitamura) • Parallel Corpus-Trainable MT • Statistical MT (Lafferty, Vogel) • Example-Based MT (Brown, Carbonell) • AVENUE Instructible MT (Levin, Lavie, Carbonell) • Multi-Engine MT (Lavie, Frederking) • Speech-to-speech MT • JANUS/DIPLOMAT/AVENUE (Waibel, Frederking, Levin, Schultz, Vogel, Lafferty, Black, …)
We also Engage in: • Tutoring Systems (Eskenazi, Callan) • Linguistic Analysis (Levin, Mitamura…) • Dialog Systems (Rudnicky, Waibel, …) • Computational Biology • Protein structure/function (Carbonell, Langmead) • DNA seq/motifs (Yang, Xing, Rosenfeld) • Complex System Design (Nyberg, Callan) • Machine Learning (Carbonell, Lafferty, Yang, Rosenfeld, Xing, Cohen,…) • Question Answering (Nyberg, Mitamura,…)
Data-driven methods Statistical learning Corpora-based Examples: Statistical MT Example-based MT Text categorization Novelty detection Translingual IR Knowledge-based Symbolic learning Linguistic analysis Knowledge represent. Examples: Interlingual MT Parsing & generation Discourse modeling Language tutoring How we do it at LTI
MMR Ranking vs StandardIR documents query MMR IR λcontrols spiral curl
Adaptive Filtering over a Document Stream Training documents (past) Test documents time Topic 1 Topic 2 Topic 3 … Current document: On-topic? Unlabeled documents On-topic documents RF Off-topic documents
Types of Machine Translation Interlingua Semantic Analysis Sentence Planning Transfer Rules Text Generation Syntactic Parsing Source (Arabic) Target (English) Direct: SMT, EBMT
EBMT Example English:I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English:The tallest man is my father. Mapudungun:Chi doy fütra chi wentru fey ta inche ñi chaw. English:I would like to meetthe tallest man Mapudungun (new):Ayükefun trawüaelChi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.
Ambiguity Makes MT Hard Word Senses for “line” (52 senses in Random House English-Japanese Dictionary) Powerline – densen (電線) Subwayline – chikatetsu(地下鉄) (Be) online– onrain (オンライン) (Be) on theline– denwachuu (電話中) Lineup– narabu (並ぶ) Lineone’s pockets – kanemochi ni naru (金持ちになる) Line one’s jacket –uwagi o nijuu ni suru (上着を二重にする) Actor’s line– serifu (セリフ) Get alineon – joho o eru (情報を得る)
CONTEXT: More is Better • “The linefor the new play extended for 3 blocks.” • “The line for the new play was changed by the scriptwriter.” • “The line for the new play got tangled with the other props.” • “The line for the new play better protected the quarterback.”
(Borrowed from: Judith Klein-Seetharaman) PROTEINS Sequence Structure Function Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA Folding 3D Structure Complex function within network of proteins Normal
Disease PROTEINS Sequence Structure Function Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA Folding 3D Structure Complex function within network of proteins
Predicting Protein Structures • Protein Structure is a key determinant of protein function • Crystalography to resolve protein structures experimentally in-vitro is very expensive, NMR can only resolve very-small proteins • The gap between the known protein sequences and structures: • 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) • Therefore we need to predict structures in-silico
Joint Labels Linked Segmentation CRF • Node: secondary structure elements and/or simple fold • Edges: Local interactions and long-range inter-chain and intra-chain interactions • L-SCRF: conditional probability of y given x is defined as
Discriminative Semi-Markov Model for Parallel Right-handed β-Helix Prediction • Structures • A regular super secondary structure with an an elongated helix whose successive rungs are composed of beta-strands • Conserved T2 turn • Computational importance • Long-range interactions • Biological importance • functions such as the bacterial infection of plants, binding the O-antigen, antifreeze,...
Some LTI Accomplishments • First large-scale web-spider (LYCOS) • First speech-speech MT (JANUS) • First high-accuracy text MT (KANT) • First minority-language MT (DIPLOMAT) • First high-accuracy translingual IR • First multidocument summarizer (MMR)