50 likes | 158 Views
Overview. Focus: Methods and technologies to store and retrieve information in the form of documents that contain text and that may also contain tables, diagrams and images
E N D
Overview Focus: Methods and technologies to store and retrieve information in the form of documents that contain text and that may also contain tables, diagrams and images • In any information system, the “real world” is represented by a collection of data abstracted from observations of the real world and made available to the system • Need, reality, data, query
Overview (cont.1) Ectosystem: system factors that are not under the control of the designer Endosystem: system factors that the designer can specify and control (e.g., algorithms) Performance • Effectiveness • Efficiency • Economy
From Data to Wisdom • Data: impersonal, and equally available • Information: set of data matched to a need, personal, and time-dependant • Knowledge • Data, information, and rules • IR&S process description
Data Compression • Level of compression; character vs. word • Data model • Statistical: build statistical tables for sample • Adaptive: starts with a priori stat distributionfor the text symbols but modifies it as each char/word is encoded • Semi-static: Start with model for, say Chapter 1, then modify for better fit of Chapter 2, and so on
Types of Codes for Text Compression • Huffman: static, binary tree • Ziv-Lempel: adaptive, identify each text segment the first time it appears and then point back when it occurs again • Arithmetic: adaptive, text steam identifies by a number that represents the statistical distribution of the symbols, later modified as the text is encoded