440 likes | 543 Views
Machine Learning in Practice Lecture 12. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. Plan for the Day. Announcements Assingment 5 handed out – Due next Thur Note: Readings for next two lectures on Blackboard in Readings folder
E N D
Machine Learning in PracticeLecture 12 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Plan for the Day • Announcements • Assingment 5 handed out – Due next Thur • Note: Readings for next two lectures on Blackboard in Readings folder • See syllabus for specifics • Feedback on Quiz 4 • Homework 4 Issues • Midterm assigned Thur, Oct 21!!! • More about Text • Term Weights • Start Linguistic Tools
Assignment 5 * 2 examples, but there are many more
TA Office Hours • Possibly moving to Wednesdays at 3 • Note that there will be a special TA session before the midterm for you to ask questions
? How are we doing on pace and level of detail? 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check 33% Check
Feedback on Quiz 4 • Nice job overall!!! • I could tell you read carefully! • Note that part-of-speech refers to grammatical categories like noun, verb, etc. • Named entity extractors locate noun phrases that refer to people, organizations, countries, etc. • Some people skipped they why and how parts of questions • Some people over-estimated the contribution of pos-tagging
Error Analysis • If I sort by different features, I can see whether rows of a particular color end up in a specific region • If I want to know which features to do this with, I can start with the most predictive features • Another option would be to use machine learning to predict which cell an instance would end up in within the confusion matrix
Other Suggestions • Use RemoveMisclassified • unsupervised attribute filter • Separates correctly classified instances from incorrectly classified instances • Works in a similar way to the remove folds filter • Only need to use it twice rather than 20 times for 10-fold cross validation • Doesn’t give you as much information
Computing Confidence Intervals • 90% confidence interval corresponds to z=1.65 • 5% chance that a data point will occur to the right of the rightmost edge of the interval • f = percentage of successes • N = number of trials • • f=75%, N=1000, c=90% -> [0.727,0.773]
Document Retrieval/ Inverted Index Doc# Freq Word Positions Stemmed Tokens DocFreq Total Frequ … ** Easy to find all documents that have terms in common with your query. ** Stemming allows you to retrieve Morphological variants (run, runs, running, runner) ** Word positions allow you to specify that you want two terms to appear within N words of each other
Evaluating Document Retrieval • Use standard measures: precision, recall, f-measure • Retrieving all documents that share words with your query will both over and under generate • If you have the whole web to select from, then under-generating is much less of a concern than over generating • Does this apply to your task?
Common Vocabulary is Not Enough • You’ll get documents that mention other senses of the term you mean • River bank versus financial institution • Word sense disambiguation is an active area of computational linguistics! • You won’t get documents that discuss other related terms
Common Vocabulary is Not Enough • You’ll get documents that mention a term but are not about that term • Partly get around this by sorting by relevance • Term weights approximate a measure of relevance • Cosine similarity between Query vector and Document vector computes relevance – then sort documents by relevance
Computing Term Weights • A common vector representation for text is to have one attribute per word in your vocabulary • Notice that Weka gives you other options
Why is it important to think about term weights? • If term frequency or salience matters for your task, you might lose too much information if you just consider whether a term ever occurred or not • On the other hand, if term frequency doesn’t matter for your task, a simpler representation will most likely work better • Term weights are important for information retrieval because in large documents, just knowing a term occurs at least once does not tell you whether that document is “about” that term
Basics of Computing Term Weights • Assume occurrence of each term is independent so that attributes are orthogonal • Obviously this isn’t true! But it’s a useful simplifying assumption • Term weight functions have two basic components • Term frequency: How many times did that term occur in the current document • Document frequency: How many times did that term occur across documents (or how many documents did that term occur in)
Basics of Computing Term Weights • Inverse document frequency: a measure of the rarity of a term • Idft = log(N/nt) where t is the term, N is the number of documents, and nt is the number of documents where that term occurred at least once • Note that inverse document frequency is 0 in the case that a term occurs in all documents • It approaches 1 for very rare terms
TF.IDF – Term Frequency/Inverse Document Frequency • A scheme that combines term frequency with inverse document frequency • Wt,d = tft,d X idft,d • Weka also gives you the option for normalizing for document length • Since terms are more likely to occur in longer documents just by chance • You can then compute the cosine similarity between the vector representation for a text and that of a query
Computing Term Weights • Notice how to set options for different types of term weights
Trying Different Term Weights • Predicting Class1, 72 instances • Note that the number of levels for each separate feature is identical for Word Count, Term Frequency, and TF.IDF.
Trying Different Term Weights • What is different is the relative weight of the different features. • Whether this matters depends on the learning method
Basic Anatomy: Layers of Linguistic Analysis • Phonology: The sound structure of language • Basic sounds, syllables, rhythm, intonation • Morphology: The building blocks of words • Inflection: tense, number, gender • Derivation: building words from other words, transforming part of speech • Syntax: Structural and functional relationships between spans of text within a sentence • Phrase and clause structure • Semantics: Literal meaning, propositional content • Pragmatics: Non-literal meaning, language use, language as action, social aspects of language (tone, politeness) • Discourse Analysis: Language in practice, relationships between sentences, interaction structures, discourse markers, anaphora and ellipsis
Sentence Segmentation • Breaking a text into sentences is a first step for processing • Why is this not trivial? • In speech there is no punctuation • In text, punctuation may be missing • Punctuation may be ambiguous (i.e., periods in abbreviations) • Alternative approaches • Rule based/regular expressions • Statistical models
Tokenization • Segment a data stream into meaningful units • Each unit is called a token • Simple rule: a token is any sequence of characters separated by white space • Leaves punctuation attached to words • But stripping out punctuation would break up large numbers like 5,235,064 • What about words like “school bus”
Automatic Segmentation • Run a sliding window 3 symbols wide across the text • Some features from outside the window used also for prediction • Each position classified as a boundary or not • Boundary between 1st and 2nd position
Automatic Segmentation • Features used for prediction: • 3 symbols • Punctuation • Whether there have been at least two capitalized words since the last boundary • Whether there have been at least three non-capitalized words since the last boundary • Whether we have seen fewer than half the number of symbols as the average segment length since the last boundary • Whether we have seen fewer than half the average number of symbols between punctuations since the last punctuation mark
Automatic Segmentation • Model trained with decision tree learning algorithm • Percent accuracy: 96% • Agreement: .44 Kappa • Precision: .59 • Recall: .37 • We assign 66% as many boundaries as the gold standard
Stemmers and Taggers • Stemmers are simple morphological analyzers • Strip a word down to the root • Run, runner, running, runs: all the same root • Next week we will use the Porter stemmer, which just chops endings off • Taggers assign syntactic categories to tokens • Words assigned potential POS tags in the lexicon • Context also plays a role
Wrap-Up • Feature space design affects classification accuracy • We examined two main ways to manipulate the feature space representation of text • One way is through alternative types of term weights • Also using linguistic tools to identify features of texts beyond just the words that make them up • Part-of-speech taggers can be customized with different tag sets • Next time we’ll talk about the tag set you will use in the assignment • We will also talk about parsers • You can use parsers to create features for classification