Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations

Special Topics in Computer ScienceThe Art of Information RetrievalChapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com

Previous chapter: Conclusions • Modeling of text helps predict behavior of systems • Zipf law, Heaps’ law • Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search • Languages to describe document syntax • SGML, too expensive • HTML, too simple • XML, good combination

Text operations • Linguistic operations • Document clustering • Compression • Encription (not discussed here)

Linguistic operations Purpose: Convert words to “meanings” • Synonyms or related words • Different words, same meaning. Morphology • Foot/feet, woman / female • Homonyms • Same words, different meanings. Word senses • River bank / financial bank • Stopwords • Word, no meaning. Functional words • The

For good or for bad? • More exact matching • Less noise, better recall • Unexpected behavior • Difficult for users to grasp • Harms if introduces errors • More expensive • Adds a whole new technology • Maintenance; language dependents • Slows down Good if done well, harmful if done badly

Document preprocessing • Lexical analysis (punctuation, case) • Simple but must be careful • Stopwords. Reduces index size and pocessing time • Stemming: connected, connection, connections, ... • Multiword expressions: hot dog, B-52 • Here, all the power of linguistic analysis can be used • Selection of index terms • Often nouns; noun groups: computer science • Construction of thesaurus • synonymy: network of related concepts (words or phrases)

Stemming • Methods • Linguistic analysis: complex, expensive maintenance • Table lookup: simple, but needs data • Statistical (Avetisyan): no data, but imprecise • Suffix removal • Suffix removal • Porter algorithm. Martin Porter. Ready code on his website • Substitution rules: sses  s, s   • stresses stress.

Better stemming The whole problematics of computational linguistics • POS disambiguation • well adverb or noun? Oil well. • Statistical methods. Brill tagger • Syntactic analysis. Syntactic disambiguation • Word sense disambiguatiuon • bank1 and bank2 should be different stems • Statistical methods • Dictionary-based methods. Lesk algorithm • Semantic analysis

Thesaurus • Terms (controlled vocabulary) and relationships • Terms • used for indexing • represent a concept. One word or a phrase. Usually nouns • sense. Definition or notes to distinguish senses: key (door). • Relationships • Paradigmatic: • Synonymy, hierarchical (is-a, part), non-hierarchical • Syntagmatic: collocations, co-occurrences • WordNet. EuroWordNet • synsets

Use of thesurus • To help the user to formulate the query • Navigation in the hierarchy of words • Yahoo! • For the program, to collate related terms • woman female • fuzzy comparison: woman  0.8 * female. Path length

Yahoo! vs. thesaurus • The book says Yahoo! is based on a thesaurus. I disagree • Tesaurus: words of language organized in hierarchy • Document hierarchy: documents attached to hierarchy • This is word sense disambiguation • I claim that Yahoo! is based on (manual) WSD • Also uses thesaurus for navigation

Document clustering • Operation on the whole collection • Global vs. local • Global: whole collection • At compile time, one-time operation • Local • Cluster the results of a specific query • At runtime, with each query • Is more a query transformation operation • Already discussed in Chapter 5

Compression • Gain: storage, transmission, search • Lost: time on compressing/decompressing • In IR: need for random access. • Blocks do not work • Also: pattern matching on compressed text

Compression methods Statistical • Huffman: fixed size per symbol. • More frequent symbols shorter • Allows starting decompression from any symbol • Arithmetic: dynamic coding • Need to decompress from the beginning • Not for IR Dictionary • Pointers to previous occurrences. Lampel-Ziv • Again not for IR

Compression ratio • Size compressed / size decompressed • Huffman, units = words: up to 2 bits per char • Close to the limit = entropy. Only for large texts! • Other methods: similar ratio, but no random access • Shannon: optimal length for symbol with probability p is - log2p • Entropy: Limit of compression • Average length with optimal coding • Property of model

Modeling • Find probability for the next symbol • Adaptive, static, semi-static • Adaptive: good compression, but need to start frombeginning • Static (for language): poor compression, random access • Semi-static (for specific text; two-pass): both OK • Word-based vs. character-based • Word-based: better compression and search

Huffman coding • Each symbol is encoded, sequentially • More frequent symbols have shorter codes • No code is a prefix of another one • How to buildthe tree: book • Byte codesare better • Allow forsequentialsearch

Dictionary-based methods • Static (simple, poor compression), dynamic, semi-static. • Lempel-Ziv: references to previous occurrence • Adaptive • Disadvantages for IR • Need to decode from the very beginning • New statistical methods perform better

Comparison of methods

Compression of inverted files • Inverted file: words + lists of docs where they occur • Lists of docs are ordered. Can be compressed • Seen as lists of gaps. • Short gaps occur more frequently • Statistical compression • Our work: order the docs for better compression • We code runs of docs • Minimize the number of runs • Distance: # of different words • TSP.

Research topics • All computational linguistics • Improved POS tagging • Improved WSD • Uses of thesaurus • for user navigation • for collating similar terms • Better compression methods • Searchable compression • Random access

Conclusions • Text transformation: meaning instead of strings • Lexical analysis • Stopwords • Stemming • POS, WSD, syntax, semantics • Ontologies to collate similar stems • Text compression • Searchable • Random access • Word-based statistical methods (Huffman) • Index compression

Thank you! Till compensation lecture

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations