1 / 18

Text Preprocessing

Text Preprocessing. Aims to create a correct text representation, according to the adopted model. Step: Lexical analysis; Case folding, numbers; Stop-words elimination; Stemming; (other preprocessing procedures ...). Preprocessing step. S tru c tur e. Index terms. Full text.

shubha
Download Presentation

Text Preprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Preprocessing

  2. Aims to create a correct text representation, according to the adopted model. Step: Lexical analysis; Case folding, numbers; Stop-words elimination; Stemming; (other preprocessing procedures ...) Preprocessing step

  3. Structure Index terms Full text Generating index terms Logical view of the documents Spaces and Signals Nominal groups Manual indexing Docs stopwords stemming structure • Stop words elimination; • Nominal groups detection; • Stemming; • Index terms generation; • Other preprocessing procedures: • Synonyms, co-occurrences, latent semantic indexing..

  4. Most common procedures: “Tokenization”: Identification of text words; Words are defined as “strings with cotinuous alphanumeric characters with no spaces, possibly including hyphens and apostrophes, but no end-of-sentence”; The most employed elements to separate words are the blank, the tab ou the new-line. Text preprocessing

  5. Problems: End-of-sentence x abbreviators; ex. Wash. Apostrophes ( ‘ ): “magic words” x contractions; ex. I’ll. Hyphens: single words x hyphenised words; ex. e-mail. Blank: sometimes does not indicate word separation; ex. database and data base; New York and San Francisco. Numbers: 9365 1873. Text preprocessing

  6. Case-Folding: the THE The => THE ; http://www.delorie.com/gnu/docs/diffutils/diff_6.html http://curry.edschool.virginia.edu/aace/conf/webnet/html/invwitt.htm http://www.dlib.org/dlib/november96/newzealand/11witten.html Text preprocessing

  7. Stop-Words removal: an, the, is, are, and, or, so, because, ... ; list on the Web (524 palavras) in the BOW library, CMU. http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words http://searchenginewatch.internet.com/facts/stopwords.html http://pen2.ci.santa-monica.ca.us/city/municode/stopwords.html Text preprocessing

  8. Stemming: compressed, compression, compressed => compress; Porter`s algorithm: http://maya.cs.depaul.edu/~mobasher/classes/ds599/porter.html http://ils.unc.edu/keyes/java/porter http://maya.cs.depaul.edu/~mobasher/classes/ds599/porter.html http://ils.unc.edu/keyes/java/porter Text preprocessing

  9. Stemming algorithm [Porter 86] Steps: 1. Plural removal, including special cases such as “sses” “ies”; 2. Union of pattern s with some suffixes such as: “ational" -> "ate", "tional" -> "tion", "enci" -> "ence", "anci" -> "ance", "iser" -> "ize", "abli" -> "able", "alli" -> "al", "entli" -> "ent", "eli" -> "e", "ousli" -> "ous", "ization" -> "ize", "isation" -> "ize", "ation" -> "ate", "ator" -> "ate", "alism" -> "al", "iveness" -> "ive", "fulness" -> "ful", "ousness" -> "ous", "aliti" -> "al", "iviti" -> "ive", "biliti" -> "ble“;

  10. Stemming algorithm [Porter 86] Steps: 3. Manipulation of special transformations such as: "icate" -> "ic", "ative" -> "", "alize" -> "al", "alise" -> "al", "iciti" -> "ic", "ical" -> "ic", "ful" -> "", "ness" -> "“ 4. Verification of composite words, including: "al", "ance", "ence", "er", "ic", "able", "ible", "ant", "ement", "ment", "ent", "sion", "tion", "ou", "ism", "ate", "iti", "ous", "ive", "ize", "ise" 5. Verification if the word ends with a vocal: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", and "pseudo".

  11. N-Grams: APPLE => _APP, APPL, PPLE, PLE_ http://www.cs.umbc.edu/ngram http://citeseer.nj.nec.com/miller99hidden.html http://citeseer.nj.nec.com/5655.html Text preprocessing

  12. Other techniques: Part-of-Speech tagger (Eric Brillwww.cs.jhu.edu/~brill/ ): Sentence separation in its syntactic or grammatical components (POS tags); Main use in terms of information content: noums, verbs, adjectives. Text preprocessing

  13. Brill POS Tagger Output • Input: Mr. Red have a red ball • Output: Mr/NNP ./. Red/NNP have/VBP a/DT red/JJ ball/NN • Part of Speech Tags

  14. POS - nouns • in general indicate generic entities (dog, tree); • for the English, consider only the plural noun variation; • the plural usually is characterized by the suffix -s (dogs, trees); • the plural has exceptions: “es” (speeches) and irregular terms (woman: women); • in addition there is the possessive case (woman’s house), called clitic.

  15. Wordnet (Princeton University): http://www.cogsci.princeton.edu/wordnet/current/ Is a database of lexemes [Miller 98]; Contain information about composite expressions (phrasal verbs, collocations, idiomatic phases, etc.); Separate its entries according to their syntactic categories: nouns, verbs, adjectives, …; In a category several semantic relations among words are stored. Text preprocessing

  16. WordNet Search for and return of Noun of Verb of Adjective of Adverb

  17. Composition: WordNet

  18. Wordnet The Wordnet contains the relations hyponym, hypersonic, meronym and holonym: • hyponym is a more specific word: cat is a hyponym of animal; • hypernym is a more generic word: animal is a hypernym of cat; • a part of the whole is a meronym: leaf is a meronym of tree; • the whole which corresponds to a part is called holonym.

More Related