1 / 26

Computational Linguistics at OSU

Computational Linguistics at OSU. Chris Brew Linguistics, Cognitive Science and CSE The Ohio State University. Who am I?. Chris Brew, Associate Professor Full-time in NLP since about 1984. B.Sc Chemistry (Bristol) Masters and Ph.D (Sussex) NLP done in a Psychology department!

tamira
Download Presentation

Computational Linguistics at OSU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Linguistics at OSU Chris Brew Linguistics, Cognitive Science and CSE The Ohio State University

  2. Who am I? • Chris Brew, Associate Professor • Full-time in NLP since about 1984. • B.Sc Chemistry (Bristol) • Masters and Ph.D (Sussex) • NLP done in a Psychology department! • Research positions at Sussex, Edinburgh and in industry (Sharp) • Faculty in Linguistics at Ohio State since 2000 • Joint appointment in CSE

  3. What I’ve done • Parsing and Dialogue • Machine Translation (teaching class now) • XML and corpus annotation • Learning word meanings from large datasets • Sound/Meaning relations • Other stuff…

  4. Linguistics • Linguistics is the scientific study of language and communication. • Linguists run experiments, do surveys, build simulations, do proofs. • Linguistics at OSU is: • In the top 10 nationally • Diverse and open-minded

  5. Strengths of Linguistics at OSU • Syntax, Semantics, Pragmatics • Phonetics: the study of how people make and perceive the sounds of language • Psycholinguistics: the study of how people process sounds, words, sentences, intonation • Sociolinguistics: the study of how society and social situations change the way we speak. • Computational Linguistics and NLP

  6. Computational Linguistics at OSU • 3 faculty members and 20 students based in Linguistics (Oxley Hall) • Detmar Meurers (Parsing, Corpus Annotation, Computer-aided Language Learning) • Chris Brew (Statistical NLP) • Michael White (Natural Language Generation) • Close ties with Drs. Byron and Fosler-Lussier in CSE. • We are willing and able to advise or co-advise on research, and have projects that cross the departmental boundaries.

  7. Computational Linguistics • Data Intensive Linguistics: using large datasets to answer questions about language • How do children learn language? • How do technical terms get their meanings? • Why do people have so little difficulty understanding what each other are saying? • How are words stored in the brain?

  8. Computational Linguistics • Machine understanding: building machines that read, write, converse using natural language. • Several well-known subtasks • Tokenization: • Parsing: building syntax trees • Building meaning representations (MR) • Generating language from MR

  9. Computational Linguistics • NLP: building systems that do useful or interesting things with language • Summarization • Machine Translation • Question Answering • Document Understanding

  10. Relation to CSE • Challenging problems in working with large datasets. • Document classification is large along three dimensions • Large number of available predictive features (104 different words in typical collections) • Many instances (1000s or millions of sentences) • Many possible outputs (e.g. classify against the 100s of labels in the DMOZ hierarchy)

  11. Relation to CSE • Consumer of CS tools • Tokenization, Parsing • Could use lex and yacc (javacc/antlr), but beware ambiguity • Many special purpose parsers, taggers, chunkers that use machine learning to achieve robustness • Machine understanding • AI-complete • Prolog and other PL innovations caused by NL research

  12. Why the world cares • 1700 biology papers per day. Nobody can keep up UNDERSTAND/SUMMARIZE • Ad placement in search engines. Perhaps you can spot a search for flights to Paris, place a successful sidebar ad for expensive and elegant evening wear. INTENT • Automated essay grading CLASSIFICATION • Too many emails to monitor. Spooks can’t keep up. • Especially in Arabic

  13. There is demand… • Develop language-independent algorithms, techniques, and methodologies to support rapid development of the basic resources … for any arbitrary language with a written form. Corpus-based unsupervised and lightly-supervised methods are acceptable, as are lightweight elicitation methodologies from untrained native speakers or other generally available (in the US) informants. Research on English and Foreign Language EXploitation (REFLEX)Broad Agency Announcement (BAA)BAA 04-01-FH15 March 2004

  14. Current work • NSF Career project • Key idea: dimensionality reduction for linguistic data. • Hypothesis: neighborhood structure is more important and cognitively salient than (for example) preserving detail of long-distance relationships • Compare: min-cut, LLE, SNE, LSI

  15. Paul Davis • Statistical Machine Translation • Is there a simple and flexible architecture for Statistical MT? • Why: current systems are all built on an IBM design. • they all mess up • they all mess up in much the same way • Alternatives are needed. • Graduated 2002:now at Motorola Research

  16. Martin Jansche • Learning String-to-String Transductions (mostly for text-to-speech) • Bucks -> /b u k z/ • Why: People were doing lots of this, but the theory, the evaluation criteria and the quality of the resulting systems left much to be desired. • Graduated 2003: now at Columbia Center for Machine Learning as research faculty

  17. Nathan Vaillette • Formally verified string-to-string transductions. • Rule: aa -> b • Input aaacaa. What is the output? • bbcb ? • bacb ? • abcb ? • Why: rules like these are used a lot, but no convincing account of exactly what they mean.

  18. … Nathan Vaillette • Used technology from hardware verification (!) to build and implement formal model of string rewriting process. • First ever implementation of this widely used component for which the specification is clear and the correspondence between specification and implementation provably correct. • Graduated 2003 Now teaching AI at Hampshire College

  19. Sabine Schulte im Walde • Inducing German Verb Classes from Corpus Data. • Why: build better dictionaries automatically • Why: difficult large dataset • Technology: k-means, spectral clustering • Graduated:2003 from University of Stuttgart Language Technology Manager with Duden dictionaries, then research staff University of Saarbrücken

  20. Kyuchul Yoon • Grapheme to Phoneme conversion for Korean • Why: words of foreign origin need special treatment, existing machine learning approaches are too knowledge-free • Graduated 2005 Now at Pusan University

  21. Anna Feldman • Using Czech language resources to bootstrap resources for Russian • Why: Czech and Russian are supposed to be related, but can we use this fact technologically? • Yes. Works, but not perfectly. • Same thing, for Spanish and Portugese

  22. Anton Rytting • Computational and experimental studies of spoken language, emphasis on word segmentation strategies that might be useful to infants • Why: infants should be able to learn any language.

  23. Medical Informatics (very new) • Collaboration with John Pestian, Cincinnati Hospital Children's Medical Center • Why: doctors provide discharge summaries (i.e. text), we want information (mundanely: ICD-9 terms as billing codes) • How: neural networks, careful encoding of domain knowledge. Tuning of ICD-9 to include/exclude terms that do/don't occur in radiology summaries

  24. What I’d like to do more of • Very large scale work • Unsupervised and lightly supervised learning • Cute applications of machine learning • Distributed and parallel NLP

  25. What I am looking for? • People who can take an idea about learning from data and turn it into a Master’s thesis. Especially people who have side expertise in an application area, such as medicine, biology, business, lion-taming. • Might have funding for the right person, though Linguistics Ph.D students take precedence.

  26. What I am looking for? • People who can take an idea about learning from data and turn it into a Master’s thesis. Especially people who have side expertise in an application area, such as medicine, biology, business, lion-taming. • People with very good communication and programming skills who could collaborate with a Linguistics student to make something better than either could alone. Cognitive Science summer fellowships. • Interesting new problems that can be learned from data.

More Related