180 likes | 359 Views
Career Opportunities for Linguists in Industry Vita Markman Computational Linguistics Engineer DIMG. Linguists in Industry??!!!. Overview I. What can a linguist do in industry? II. Some concrete examples of what linguists work on “in the real world”
E N D
Career Opportunities for Linguists in Industry Vita Markman Computational Linguistics Engineer DIMG
Linguists in Industry??!!! Overview I. What can a linguist do in industry? II. Some concrete examples of what linguists work on “in the real world” II. From school to work: what skills/knowledge one needs IV. Useful References V. Conclusion and Q&A Vita Markman
But Why? Because: Having options is good You may want to take a year off before going to grad school, yet work on something relevant It's really fun Research jobs are NOT confined to the academic setting Not all industry jobs are like from 'The Office Space'
Explosion of Text Data Explosion of linguistic data online Social media (twitter, facebook, blogosphere) Linguistic data is not easily amenable to analysis. It requires much processing and much insight into the nature and structure of language Modifying old NLP techniques and tools as well as inventing new ones: for example, no off-the-shelf parser can handle Twitter conversations.
A linguist in the industry? Examples: Information retrieval (esp. search engines that do semantic search) Voice recognition/generation – think 'google voice'! Text classification and text clustering Text mining – finding grains of useful info in unstructured text E-discovery Analyzing the language of social media (topic and sentiment extraction from short fragmented and noisy data)
Chat filtering: an example Computational linguistics application for on-line virtual worlds Ensuring safety of on-line chat environments Filtering chat for appropriate content Filtering is an example of a classification problem: classify text as ‘appropriate’ or ‘inappropriate’
Chat filtering: an example • The problem of determining whether lines involve inappropriate content is similar to spam detection • Simple word/phrase matches are not enough. • It will be too aggressive and not be enough: some nefarious lines may pass through • Most inappropriate talk is made up of completely innocent words!
Chat filtering • People use innocent words to make inappropriate phrases • People also find ways to say things with MixEdCAse or b,r,o.k.en w.o.r,ds that get around the filter • We want: a general way of saying: “if you see MixED CaSE or br.ok.en word,s it is probably a sign of something bad” • We also want: to capture inappropriate combinations of innocent words • A solution: a model, much like the ones used for spam • What is the general idea?
The idea behind filtering • Look at the words and other features that make up appropriate and inappropriate chat, and ask how likely is a word to appear in inappropriate chat? • Example: If a pair of words such as “stew pit” never appears in regular, appropriate chat, it is probably an indicator of something inappropriate. • Having done that, we can ask for a new chat segment, is it likely to be inappropriate?
That said… • Filtering is an example of text classification, used very broadly • Classifying documents by content/topic requires a labeled set of data and a learning algorithm • The difficult part is not the algorithm itself, but the fine-tuning of its parameters and manipulating and preprocessing the data • This is where creative and analytical thinking becomes truly crucial and the job becomes really fun!
Another example: Twitter • Clustering super-short twitter posts by topic • Clustering = finding groups in unlabeled data • This data is very noisy and fragmented lets see im on lates on monday so dont start till two and could get down from work .......mail me xxx 7/16/2009 11:50:26 AM Well i'm watching mate running at silly o'clock but i'll be free in the late afternoon 7/17/2009 8:18:12 AM • Resistant to parsing and part of speech tagging • Goal: arrive at K clusters, each representing a topic of the post! • Problem: very little to go on!
Twitter • Possible solutions: padding – adding more relevant words to each post • Applying spellcheckers to reduce the noise in the data • Removing various content-less function words, known as stop-words. • Since even those posts that share a common topic, do not often share many words in common, look at the mutual contexts in which the words in posts appear • This technique is known as Latent Semantic Analysis
From school to workplace What skills are needed for a future (relevant) career in computational linguistics ? Statistics: basic understanding of sampling methods and probabilistic reasoning Some linear algebra, esp. as it relates to matrix and vector manipulation Some calculus Machine learning algorithms that are used commonly in NLP such as Naïve Bayes, HMM, and Expectation Maximization Some programming (python, java, C++)
From school to workplace cont'd How can these skills be learned? Formal instruction Self-teaching Taking a class on-line
Companies that employ linguists • Google (Bay area, Los Angeles) • Yahoo (Bay area, Los Angeles) • IBM • Microsoft (Seattle, WA) • Smaller search engines and semantic search engines: Ask.com Autonomy (Bay area) H5 (San Francisco) Cognition (Los Angeles) • Companies that do Machine Translation (Systran, Language Weaver) • Entertainment Companies (Los Angeles)
Learning while working • Internships: Companies hire students to work and learn • This can be an invaluable experience ! • It can really supplement classroom learning via the actual application of the learned material • Sometimes academic knowledge and practical application do not go hand-in-hand…
Resources Association for Computational Linguistics (Conference in Portland OR in June, 2011) KDD Nuggets – a website for data mining and knowledge discovery, contains useful links to tutorials, news, and jobs Dice.com – job website for technical jobs only NLTK.org – natural language toolkit in python. Can be used to try a 'do it yourself' document classification and clustering. NLP Group at Information Science Institute at USC, located in Marina del Rey – great talks to get an overview of what’s going on in the field Books and articles: Jurafskyand Martin 2009 - The bible of computational linguistics Chris Manning 2008 – Information Retrieval Mitchel1997 Machine Learning; WEKA – Data Mining free software and book (Witten and Frank 2005)
Conclusion As a linguist, you can do lots of interesting work in a non-academic setting You must supplement your knowledge of linguistics by some math and computer science knowledge and you are good to go! Bottom line: all you really need is to be open to learning new things, which is exactly why we all go to school in the first place! Thank you!