690 likes | 864 Views
Processing of Large Document Collections 1. Helena Ahonen-Myka University of Helsinki. Organization of the course. Classes: 17.9., 22.10., 23.10., 26.11. lectures (Helena Ahonen-Myka): 10-12,13-15 exercise sessions (Lili Aunimo): 15-17 required presence: 75%
E N D
Processing of Large Document Collections 1 Helena Ahonen-Myka University of Helsinki
Organization of the course • Classes: 17.9., 22.10., 23.10., 26.11. • lectures (Helena Ahonen-Myka): 10-12,13-15 • exercise sessions (Lili Aunimo): 15-17 • required presence: 75% • Exercises are given (and returned) each week • required: 75% • Exam: 4.12. at 16-20, Auditorio • Points: Exam 30 pts, exercises 30 pts
Schedule • 17.9. Character sets, preprocessing of text, text categorization • 22.10. Text summarization • 23.10. Text compression • 26.11. … to be announced… • self-study: basic transformations for text data, using linguistic tools, etc.
In this part... • Character sets • preprocessing of text • text categorization
1. Character sets • Abstract character vs. its graphical representation • abstract characters are grouped into alphabets • each alphabet forms the basis of the written form of a certain language or a set of languages
Character sets • For instance • for English: • uppercase letters A-Z • lowercase letters a-z • punctuation marks • digits 0-9 • common symbols: +, = • ideographic symbols of Chinese and Japanese • phonetic letters of Western languages
Character sets • To represent text digitally, we need a mapping between (abstract) characters and values stored digitally (integers) • this mapping is a character set • the domain of the character set is called a character repertoire (= the alphabet for which the mapping is defined)
Character sets • For each character in the character repertoire, the character set defines a code value in the set of code points • in English: • 26 letters in both lower- and uppercase • ten digits + some punctuation marks • in Russian: cyrillic letters • both could use the same set of code points (if not a bilingual document) • in Japanese: could be over 6000 characters
Character sets • The mere existence of a character set supports operations like editing and searching of text • usually character sets have some structure • e.g. integers within a small range • all lower-case (resp. upper-case) letters have code values that are consecutive integers (simplifies sorting etc.)
Character sets: standars • Character sets can be arbitrary, but in practice standardization is needed for interoperability (between computers, programs,...) • early standards were designed for English only, or for a small group of languages at a time
Character sets: standards • ASCII • ISO-8859 (e.g. ISO Latin1) • Unicode • UTF-8, UTF-16
ASCII • American Standard Code for Information Interchange • A seven bit code -> 128 code points • actually 95 printable characters only • code points 0-31 and 128 are assigned to control characters (mostly outdated) • ISO 646 (1972) version of ASCII incorporated several national variants (accented letters and currency symbols)
ASCII • With 7 bits, the set of code points is too small for anything else than American English • solution: • 8 bits brings more code points (256) • ASCII character repertoire is mapped to the values 0-127 • additional symbols are mapped to other values
Extended ASCII • Problem: • different manufacturers each developed their own 8-bit extensions to ASCII • different character repertoires -> translation between them is not always possible • also 256 code values is not enough to represent all the alphabets -> different variants for different languages
ISO 8859 • Standardization of 8-bit character sets • In the 80´s: multipart standard ISO 8859 was produced • defines a collection of 8-bit character sets, each designed for a group of languages • the first part: ISO 8859-1 (ISO Latin1) • covers most Western European languages • 0-127: identical to ASCII, 128-159 (mostly) unused, 96 code values for accented letters and symbols
Unicode • 256 is not enough code points • for ideographically represented languages (Chinese, Japanese…) • for simultaneous use of several languages • solution: more than one byte for each code value • a 16-bit character set has 65,536 code points
Unicode • 16-bit character set, e.g. 65,536 code points • not sufficient for all the characters required for Chinese, Japanese, and Korean scripts in distinct positions • CJK-consolidation: characters of these scripts are given the same value if they look the same
Unicode • Code values for all the characters used to write contemporary ’major’ languages • also the classical forms of some languages • Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan • Chinese, Japanese, and Korean ideograms, and the Japanese and Korean phonetic and syllabic scripts
Unicode • punctuation marks • technical and mathematical symbols • arrows • dingbats (pointing hands, stars, …) • both accented letters and separate diacritical marks (accents, tildes…) are included, with a mechanism for building composite characters • can also create problems: two characters that look the same may have different code values • ->normalization may be necessary
Unicode • Code values for nearly 39,000 symbols are provided • some part is reserved for an expansion method (see later) • 6,400 code points are reserved for private use • they will never be assigned to any character by the standard, so they will not conflict with the standard
Unicode: encodings • Encoding is a mapping that transforms a code value into a sequence of bytes for storage and transmission • identity mapping for a 8-bit code? • it may be necessary to encode 8-bit characters as sequences of 7-bit (ASCII) characters • e.g. Quoted-Printable (QP) • code values 128-255 as a sequence of 3 bytes • 1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the value • 233 -> E9 -> =E9
Unicode: encodings • UTF-8 • ASCII code values are likely to be more common in most text than any other values • in UTF-9 encoding ASCII characters are sent themselves (high-order bit 0) • other characters (two bytes) are encoded using up to six bytes (high-order bit is set to 1)
Unicode: encodings • UTF-16: expansion method • two 16-bit values are combined to a 32-bit value -> a million characters available
2. Preprocessing of text • Text cannot be directly interpreted by the many document processing applications • an indexing procedure is needed • mapping of a text into a compact representation of its content • which are the meaningful units of text? • how these units should be combined? • usually not ”important”
Vector model • A document is usually represented as a vector of term weights • the vector has as many dimensions as there are terms (or features) in the whole collection of documents • the weight represents how much the term contributes to the semantics of the document
Vector model • Different approaches: • different ways to understand what a term is • different ways to compute term weights
Terms • Words • typical choice • set of words, bag of words • phrases • syntactical phrases • statistical phrases • usefulness not yet known?
Terms • Part of the text is not considered as terms • very common words (function words): • articles, prepositions, conjunctions • numerals • these words are pruned • stopword list • other preprocessing possible • stemming, base words
Weights of terms • Weights usually range between 0 and 1 • binary weights may be used • 1 denotes presence, 0 absence of the term in the document • often the tfidf function is used • higher weight, if the term occurs often in the document • lower weight, if the term occurs in many documents
Structure • Either the full text of the document or selected parts of it are indexed • e.g. in a patent categorization application • title, abstract, the first 20 lines of the summary, and the section containing the claims of novelty of the described invention • some parts may be considered more important • e.g. higher weight for the terms in the title
Dimensionality reduction • Many algorithms cannot handle high dimensionality of the term space (= large number of terms) • usually dimensionality reduction is applied • dimensionality reduction also reduces overfitting • classifier that overfits the training data is good at re-classifying the training data but worse at classifying previously unseen data
Dimensionality reduction • Local dimensionality reduction • for each category, a reduced set of terms is chosen for classification that category • hence, different subsets are used when working with different categories • global dimensionality reduction • a reduced set of terms is chosen for the classification under all categories
Dimensionality reduction • Dimensionality reduction by term selection • the terms of the reduced term set are a subset of the original term set • Dimensionality reduction by term extraction • the terms are not the same type of the terms in the original term set, but are obtained by combinations and transformations of the original ones
Dimensionality reduction by term selection • Goal: select terms that, when used for document indexing, yields the highest effectiveness in the given application • wrapper approach • the reduced set of terms is found iteratively and tested with the application • filtering approach • keep the terms that receive the highest score according to a function that measures the ”importance” of the term for the task
Dimensionality reduction by term selection • Many functions available • document frequency: keep the high frequency terms • stopwords have been already removed • 50% of the words occur only once in the document collection • e.g. remove all terms occurring in at most 3 documents
Dimensionality reduction by term selection • Information-theoretic term selection functions, e.g. • chi-square • information gain • mutual information • odds ratio • relevancy score
Dimensionality reduction by term extraction • Term extraction attempts to generate, from the original term set, a set of ”synthetic” terms that maximize effectiveness • due to polysemy, homonymy, and synonymy, the original terms may not be optimal dimensions for document content representation
Dimensionality reduction by term extraction • Term clustering • tries to group words with a high degree of pairwise semantic relatedness • groups (or their centroids) may be used as dimensions • latent semantic indexing • compresses document vector into vectors of a lower-dimensional space whose dimensions are obtained as combinations of the original dimensions by looking at their patterns of co-occurrence
3. Text categorization • Text classification, topic classification/spotting/detection • problem setting: • assume: a predefined set of categories, a set of documents • label each document with one (or more) categories
Text categorization • Two major approaches: • knowledge engineering -> end of 80’s • manually defined set of rules encoding expert knowledge on how to classify documents under the given gategories • machine learning, 90’s -> • an automatic text classifier is built by learning, from a set of preclassified documents, the characteristics of the categories
Text categorization • Let • D: a domain of documents • C = {c1, …, c|C|} : a set of predefined categories • T = true, F = false • The task is to approximate the unknown target function ’: D x C -> {T,F} by means of a function : D x C -> {T,F}, such that the functions ”coincide as much as possible” • function ’ : how documents should be classified • function : classifier (hypothesis, model…)
We assume... • Categories are just symbolic labels • no additional knowledge of their meaning is available • No knowledge outside of the documents is available • all decisions have to be made on the basis of the knowledge extracted from the documents • metadata, e.g., publication date, document type, source etc. is not used
-> general methods • Methods do not depend on any application-dependent knowledge • in operational applications all kind of knowledge can be used • content-based decisions are necessarily subjective • it is often difficult to measure the effectiveness of the classifiers • even human classifiers do not always agree
Single-label vs. multi-label • Single-label text categorization • exactly 1 category must be assigned to each dj D • Multi-label text categorization • any number of categories may be assigned to the same dj D • Special case of single-label: binary • each dj must be assigned either to category ci or to its complement ¬ ci
Single-label, multi-label • The binary case (and, hence, the single-label case) is more general than the multi-label • an algorithm for binary classification can also be used for multi-label classification • the converse is not true
Category-pivoted vs. document-pivoted • Two different ways for using a text classifier • given a document, we want to find all the categories, under which it should be filed -> document-pivoted categorization (DPC) • given a category, we want to find all the documents that should be filed under it -> category-pivoted categorization (CPC)
Category-pivoted vs. document-pivoted • The distinction is important, since the sets C and D might not be available in their entirety right from the start • DPC: suitable when documents become available at different moments in time, e.g. filtering e-mail • CPC: suitable when new categories are added after some documents have already been classified (and have to be reclassified)
Category-pivoted vs. document-pivoted • Some algorithms may apply to one style and not the other, but most techniques are capable of working in either mode
Hard-categorization vs. ranking categorization • Hard categorization • the classifier answers T or F • Ranking categorization • given a document, the classifier might rank the categories according to their estimated appropriateness to the document • respectively, given a category, the classifier might rank the documents
Applications of text categorization • Automatic indexing for Boolean information retrieval systems • document organization • text filtering • word sense disambiguation • hierarchical categorization of Web pages