100 likes | 245 Views
Discussion Class 3. Stemming Algorithms. Discussion Classes. Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear.
E N D
Discussion Class 3 Stemming Algorithms
Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment When answering: Give your name. Make sure that the TA hears it. Stand up Speak clearly so that all the class can hear
Question 1: Conflation methods (a) Define the terms: stem, suffix, prefix, conflation, morpheme (b) Define the terms in the following diagram: Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal
Question 2: Table look-up (a) What are the advantages and disadvantages of table look-up methods? (b) When would you use table look-up?
Question 3: Successor variety methods Hafer and Weiss defined their technique as: Let be a word of length n, iis a length i prefix of . Let D be the corpus of words. Di is defined as the subset of D containing the terms whose first i letters match i exactly. The successor variety of i, denoted by Si, is then defined as the number of letters that occupy the i+1 st position of words in Di. A test word of length n has n successor varieties Si, Si, ..., Si. Explain this definition, using the word "computation" as an example.
Question 4: Successor variety methods With successor variety methods, how do the following methods of segmentation work? (a) cutoff method (b) peak and plateau method (c) complete word method
Question 5: n-gram methods (a) Explain the following notation: statistics => st ta at ti is st ti ic cs unique diagrams =>at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique diagrams => al at ca ic is st ta ti (b) Calculate the similarity using Dice's coefficient: S = 2C A + B A is the number of unique diagrams in the first term B is the number of unique diagrams in the second term C is the number of shared unique diagrams (c) How would you use this approach for stemming?
Question 6: Porter's algorithm (a) What is an iterative, longest match stemmer? (b) How is longest match achieved in the Porter algorithm?
Question 7: Porter's algorithm Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing (a) Explain this table (b) How does this table apply to: "exceeding", "ringed"?
Question 8: Evaluation (a) What is the overall effectiveness of stemming? (b) Give a possible reason why Stemmer A might be better than Stemmer B on Collection X but worse on Collection Y.