160 likes | 273 Views
Probabilistic Language Processing. Chapter 23. Probabilistic Language Models. Goal -- define probability distribution over set of strings Unigram, bigram, n-gram Count using corpus but need smoothing: add-one Linear interpolation Evaluate with Perplexity measure
E N D
Probabilistic Language Processing Chapter 23
Probabilistic Language Models • Goal -- define probability distribution over set of strings • Unigram, bigram, n-gram • Count using corpus but need smoothing: • add-one • Linear interpolation • Evaluate with Perplexity measure • E.g. segmentwordswithoutspaces w/ Viterbi
PCFGs • Rewrite rules have probabilities. • Prob of a string is sum of probs of its parse trees. • Context-freedom means no lexical constraints. • Prefers short sentences.
Learning PCFGs • Parsed corpus -- count trees. • Unparsed corpus • Rule structure known -- use EM (inside-outside algorithm) • Rules unknown -- Chomsky normal form… problems.
Information Retrieval • Goal: Google. Find docs relevant to user’s needs. • IR system has doc. Collection, query in some language, set of results, and a presentation of results. • Ideally, parse docs into knowledge base… too hard.
IR 2 • Boolean Keyword Model -- in or out? • Problem -- single bit of “relevance” • Boolean combinations a bit mysterious • How compute P(R=true | D,Q)? • Estimate language model for each doc, computes prob of query given the model. • Can rank documents by P(r|D,Q)/P(~r|D,Q)
IR3 • For this, need model of how queries are related to docs. Bag of words: freq of words in doc., naïve Bayes. • Good example pp 842-843.
Evaluating IR • Precision is proportion of results that are relevant. • Recall is proportion of relevant docs that are in results • ROC curve (there are several varieties): standard is to plot false negatives vs. false positives. • More “practical” for web: reciprocal rank of first relevant result, or just “time to answer”
IR Refinements • Case • Stems • Synonyms • Spelling correction • Metadata --keywords
IR Presentation • Give list in order of relevance, deal with duplicates • Cluster results into classes • Agglomerative • K-means • How describe automatically-generated clusters? Word list? Title of centroid doc?
IR Implementation • CSC172! • Lexicon with “stop list”, • “inverted” index: where words occur • Match with vectors: vectorof freq of words dotted with query terms.
Information Extraction • Goal: create database entries from docs. • Emphasis on massive data, speed, stylized expressions • Regular expression grammars OK if stylized enough • Cascaded Finite State Transducers,,,stages of grouping and structure-finding
Machine Translation Goals • Rough Translation (E.g. p. 851) • Restricted Doman (mergers, weather) • Pre-edited (Caterpillar or Xerox English) • Literary Translation -- not yet! • Interlingua-- or canonical semantic representation like Conceptual Dependency • Basic Problem != languages, != categories
MT in Practice • Transfer -- uses data base of rules for translating small units of language • Memory -based. Memorize sentence pairs • Good diagram p. 853
Statistical MT • Bilingual corpus • Find most likely translation given corpus. • Argmax_F P(F|E) = argmax_F P(E|F)P(F) • P(F) is language model • P(E|F) is translation model • Lots of interesting problems: fertility (home vs. a la maison). • Horrible drastic simplfications and hacks work pretty well!
Learning and MT • Stat. MT needs: language model, fertility model, word choice model, offset model. • Millions of parameters • Counting , estimate, EM.