1 / 25

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations. Alexander Gelbukh www.Gelbukh.com. Previous chapter: Conclusions. Modeling of text helps predict behavior of systems Zipf law, Heaps’ law

Download Presentation

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Computer ScienceThe Art of Information RetrievalChapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com

  2. Previous chapter: Conclusions • Modeling of text helps predict behavior of systems • Zipf law, Heaps’ law • Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search • Languages to describe document syntax • SGML, too expensive • HTML, too simple • XML, good combination

  3. Text operations • Linguistic operations • Document clustering • Compression • Encription (not discussed here)

  4. Linguistic operations Purpose: Convert words to “meanings” • Synonyms or related words • Different words, same meaning. Morphology • Foot/feet, woman / female • Homonyms • Same words, different meanings. Word senses • River bank / financial bank • Stopwords • Word, no meaning. Functional words • The

  5. For good or for bad? • More exact matching • Less noise, better recall • Unexpected behavior • Difficult for users to grasp • Harms if introduces errors • More expensive • Adds a whole new technology • Maintenance; language dependents • Slows down Good if done well, harmful if done badly

  6. Document preprocessing • Lexical analysis (punctuation, case) • Simple but must be careful • Stopwords. Reduces index size and pocessing time • Stemming: connected, connection, connections, ... • Multiword expressions: hot dog, B-52 • Here, all the power of linguistic analysis can be used • Selection of index terms • Often nouns; noun groups: computer science • Construction of thesaurus • synonymy: network of related concepts (words or phrases)

  7. Stemming • Methods • Linguistic analysis: complex, expensive maintenance • Table lookup: simple, but needs data • Statistical (Avetisyan): no data, but imprecise • Suffix removal • Suffix removal • Porter algorithm. Martin Porter. Ready code on his website • Substitution rules: sses  s, s   • stresses stress.

  8. Better stemming The whole problematics of computational linguistics • POS disambiguation • well adverb or noun? Oil well. • Statistical methods. Brill tagger • Syntactic analysis. Syntactic disambiguation • Word sense disambiguatiuon • bank1 and bank2 should be different stems • Statistical methods • Dictionary-based methods. Lesk algorithm • Semantic analysis

  9. Thesaurus • Terms (controlled vocabulary) and relationships • Terms • used for indexing • represent a concept. One word or a phrase. Usually nouns • sense. Definition or notes to distinguish senses: key (door). • Relationships • Paradigmatic: • Synonymy, hierarchical (is-a, part), non-hierarchical • Syntagmatic: collocations, co-occurrences • WordNet. EuroWordNet • synsets

  10. Use of thesurus • To help the user to formulate the query • Navigation in the hierarchy of words • Yahoo! • For the program, to collate related terms • woman female • fuzzy comparison: woman  0.8 * female. Path length

  11. Yahoo! vs. thesaurus • The book says Yahoo! is based on a thesaurus. I disagree • Tesaurus: words of language organized in hierarchy • Document hierarchy: documents attached to hierarchy • This is word sense disambiguation • I claim that Yahoo! is based on (manual) WSD • Also uses thesaurus for navigation

  12. Text operations • Linguistic operations • Document clustering • Compression • Encription (not discussed here)

  13. Document clustering • Operation on the whole collection • Global vs. local • Global: whole collection • At compile time, one-time operation • Local • Cluster the results of a specific query • At runtime, with each query • Is more a query transformation operation • Already discussed in Chapter 5

  14. Text operations • Linguistic operations • Document clustering • Compression • Encription (not discussed here)

  15. Compression • Gain: storage, transmission, search • Lost: time on compressing/decompressing • In IR: need for random access. • Blocks do not work • Also: pattern matching on compressed text

  16. Compression methods Statistical • Huffman: fixed size per symbol. • More frequent symbols shorter • Allows starting decompression from any symbol • Arithmetic: dynamic coding • Need to decompress from the beginning • Not for IR Dictionary • Pointers to previous occurrences. Lampel-Ziv • Again not for IR

  17. Compression ratio • Size compressed / size decompressed • Huffman, units = words: up to 2 bits per char • Close to the limit = entropy. Only for large texts! • Other methods: similar ratio, but no random access • Shannon: optimal length for symbol with probability p is - log2p • Entropy: Limit of compression • Average length with optimal coding • Property of model

  18. Modeling • Find probability for the next symbol • Adaptive, static, semi-static • Adaptive: good compression, but need to start frombeginning • Static (for language): poor compression, random access • Semi-static (for specific text; two-pass): both OK • Word-based vs. character-based • Word-based: better compression and search

  19. Huffman coding • Each symbol is encoded, sequentially • More frequent symbols have shorter codes • No code is a prefix of another one • How to buildthe tree: book • Byte codesare better • Allow forsequentialsearch

  20. Dictionary-based methods • Static (simple, poor compression), dynamic, semi-static. • Lempel-Ziv: references to previous occurrence • Adaptive • Disadvantages for IR • Need to decode from the very beginning • New statistical methods perform better

  21. Comparison of methods

  22. Compression of inverted files • Inverted file: words + lists of docs where they occur • Lists of docs are ordered. Can be compressed • Seen as lists of gaps. • Short gaps occur more frequently • Statistical compression • Our work: order the docs for better compression • We code runs of docs • Minimize the number of runs • Distance: # of different words • TSP.

  23. Research topics • All computational linguistics • Improved POS tagging • Improved WSD • Uses of thesaurus • for user navigation • for collating similar terms • Better compression methods • Searchable compression • Random access

  24. Conclusions • Text transformation: meaning instead of strings • Lexical analysis • Stopwords • Stemming • POS, WSD, syntax, semantics • Ontologies to collate similar stems • Text compression • Searchable • Random access • Word-based statistical methods (Huffman) • Index compression

  25. Thank you! Till compensation lecture

More Related