Text Preprocessing

Text Preprocessing

Aims to create a correct text representation, according to the adopted model. Step: Lexical analysis; Case folding, numbers; Stop-words elimination; Stemming; (other preprocessing procedures ...) Preprocessing step

Structure Index terms Full text Generating index terms Logical view of the documents Spaces and Signals Nominal groups Manual indexing Docs stopwords stemming structure • Stop words elimination; • Nominal groups detection; • Stemming; • Index terms generation; • Other preprocessing procedures: • Synonyms, co-occurrences, latent semantic indexing..

Most common procedures: “Tokenization”: Identification of text words; Words are defined as “strings with cotinuous alphanumeric characters with no spaces, possibly including hyphens and apostrophes, but no end-of-sentence”; The most employed elements to separate words are the blank, the tab ou the new-line. Text preprocessing

Problems: End-of-sentence x abbreviators; ex. Wash. Apostrophes ( ‘ ): “magic words” x contractions; ex. I’ll. Hyphens: single words x hyphenised words; ex. e-mail. Blank: sometimes does not indicate word separation; ex. database and data base; New York and San Francisco. Numbers: 9365 1873. Text preprocessing

Case-Folding: the THE The => THE ; http://www.delorie.com/gnu/docs/diffutils/diff_6.html http://curry.edschool.virginia.edu/aace/conf/webnet/html/invwitt.htm http://www.dlib.org/dlib/november96/newzealand/11witten.html Text preprocessing

Stop-Words removal: an, the, is, are, and, or, so, because, ... ; list on the Web (524 palavras) in the BOW library, CMU. http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words http://searchenginewatch.internet.com/facts/stopwords.html http://pen2.ci.santa-monica.ca.us/city/municode/stopwords.html Text preprocessing

Stemming: compressed, compression, compressed => compress; Porter`s algorithm: http://maya.cs.depaul.edu/~mobasher/classes/ds599/porter.html http://ils.unc.edu/keyes/java/porter http://maya.cs.depaul.edu/~mobasher/classes/ds599/porter.html http://ils.unc.edu/keyes/java/porter Text preprocessing

Stemming algorithm [Porter 86] Steps: 1. Plural removal, including special cases such as “sses” “ies”; 2. Union of pattern s with some suffixes such as: “ational" -> "ate", "tional" -> "tion", "enci" -> "ence", "anci" -> "ance", "iser" -> "ize", "abli" -> "able", "alli" -> "al", "entli" -> "ent", "eli" -> "e", "ousli" -> "ous", "ization" -> "ize", "isation" -> "ize", "ation" -> "ate", "ator" -> "ate", "alism" -> "al", "iveness" -> "ive", "fulness" -> "ful", "ousness" -> "ous", "aliti" -> "al", "iviti" -> "ive", "biliti" -> "ble“;

Stemming algorithm [Porter 86] Steps: 3. Manipulation of special transformations such as: "icate" -> "ic", "ative" -> "", "alize" -> "al", "alise" -> "al", "iciti" -> "ic", "ical" -> "ic", "ful" -> "", "ness" -> "“ 4. Verification of composite words, including: "al", "ance", "ence", "er", "ic", "able", "ible", "ant", "ement", "ment", "ent", "sion", "tion", "ou", "ism", "ate", "iti", "ous", "ive", "ize", "ise" 5. Verification if the word ends with a vocal: "kilo", "micro", "milli", "intra", "ultra", "mega", "nano", "pico", and "pseudo".

N-Grams: APPLE => _APP, APPL, PPLE, PLE_ http://www.cs.umbc.edu/ngram http://citeseer.nj.nec.com/miller99hidden.html http://citeseer.nj.nec.com/5655.html Text preprocessing

Other techniques: Part-of-Speech tagger (Eric Brillwww.cs.jhu.edu/~brill/ ): Sentence separation in its syntactic or grammatical components (POS tags); Main use in terms of information content: noums, verbs, adjectives. Text preprocessing

Brill POS Tagger Output • Input: Mr. Red have a red ball • Output: Mr/NNP ./. Red/NNP have/VBP a/DT red/JJ ball/NN • Part of Speech Tags

POS - nouns • in general indicate generic entities (dog, tree); • for the English, consider only the plural noun variation; • the plural usually is characterized by the suffix -s (dogs, trees); • the plural has exceptions: “es” (speeches) and irregular terms (woman: women); • in addition there is the possessive case (woman’s house), called clitic.

Wordnet (Princeton University): http://www.cogsci.princeton.edu/wordnet/current/ Is a database of lexemes [Miller 98]; Contain information about composite expressions (phrasal verbs, collocations, idiomatic phases, etc.); Separate its entries according to their syntactic categories: nouns, verbs, adjectives, …; In a category several semantic relations among words are stored. Text preprocessing

WordNet Search for and return of Noun of Verb of Adjective of Adverb

Composition: WordNet

Wordnet The Wordnet contains the relations hyponym, hypersonic, meronym and holonym: • hyponym is a more specific word: cat is a hyponym of animal; • hypernym is a more generic word: animal is a hypernym of cat; • a part of the whole is a meronym: leaf is a meronym of tree; • the whole which corresponds to a part is called holonym.

Text Preprocessing

Text Preprocessing

Presentation Transcript

Data Preprocessing

General Preprocessing

Data Preprocessing

Data preprocessing

Microarray Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Text Operations: Preprocessing

Data Preprocessing

Spatial Preprocessing

Data Preprocessing

Data Preprocessing

Preprocessing

Data Preprocessing

Volume Preprocessing