170 likes | 289 Views
WMES3103 : INFORMATION RETRIEVAL. TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES. INTRODUCTION. Text - main form of communicating data and information Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive
E N D
WMES3103 : INFORMATION RETRIEVAL TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES
INTRODUCTION • Text - main form of communicating data and information • Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive • Website with a combination ot text and multimedia will be visited by many as compared to one which is text-based only • IRS - text and multimedia is depicted via special languages.
Metadata • New concept on information – metadata • Information about data arrangement, data domain and relationship between the two • Data about data • 2 types – descriptive and semantic
descriptive Metadata – metadata which explain about document or one unit of information • Commonly used Metadata : • Authors • Date of publication • Source of publication • Length of document • Type of document
Metadata • semantic Metadata –resembles subject that can be obtain from the contents of the document – subjects heading • Keywords • LC Code
TEXT • With computers, we need to code text into binary digits • First coding schemes – EBCDIC and ASCII – 7 bits to code each symbol • Then, ASCII changed to 8 bits to accommodate other languages, accents and diacritical marks • Oriental languages – Unicode – 16 bits
TEXT Formats • No one single format for a text document • Good IRS system should be able to retrieve information from any format • Initially, IRS will convert a document to an internal format but this had a lot of disadvantages • Now, many new format has been developed for document interchange
TEXT • RTF – Rich Text Format for word processing • PDF – Portable Document Format for displaying and printing documents • Postscript – powerful programming language for drawing • MIMT – Multipurpose Internet Mail Exchange to encode e-mail • Files are compressed – Compress (Unix), ARJ (PCs), ZIP • Convert binary files to ASCII text –uuencode/uudecode, binhex
MARKUP LANGUAGES • Markup = extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc. • Formal markup languages are more structured • Marks = tags - initial and ending tag surrounding the marked text • Standard metalanguage = SGML • New metalanguange for Web = XML (eXtensible Markup Language) = subset of SGML • Most popular markup language used for the Web = HTML (HyperText Markup Language)
MULTIMEDIA • Applications that handle different types of digital data originating from distinct types of media • Text, sound, images, video • Digital data distinct and different in volume, format, and processing requirements • Different types of formats necessary for storing each type of media
MULTIMEDIA • Different formats used commonly on the Web and in digital libraries • Images • Audio • Moving Images • Textual Images • Graphics and Virtual Reality
IMAGES • XBM, BMP, PCX – direct representation of a bit-mapped (or pixel-based) • GIF (Graphic Interchange Format) – includes compression and good for black or white or with small number of clours or gray levels (256) • JPEG (Joint Photographic Experts Group) – includes compression • TIFF (Tagged Image File Format) – used to exchange different documents between different applications and different computer platforms • TGA (Television Targa image file) – associated with video game boards • Various other image formats
AUDIO • Must be digitized before storage • AU, MIDI (standard format to interchange music between electronic instruments and computers), WAVE – for small pieces of digital audio • Audio libraries – RealAudio or CD formats • Animation or moving pictures • MPEG (Moving Pictures Expert Group) – related to JPEG • Others – AVI, FLI, QuickTime
TEXTUAL IMAGES • Images that contain mainly typed or typeset text • Obtained by scanning the documents • For archival purposes • Saved as images but with further compression • Textual and non-textual stored and compressed separately and when neded can be combined and displayed together
GRAPHICS AND VIRTUAL REALITY • 3-dimensional graphics found on Web • CGM (Computer Graphics Metafile) standard • Metafile = collection of elements • CGM standard specifies which elements are allowed to occur in which positions in a metafile • VRML (Virtual Reality Modeling Language) – file format for describing interactive 3D objects and worlds - universal interchange format for 3D graphics and multimedia - can be used for various applications
MULTIMEDIA DOCUMENTS MARKUP • HyTime = Hyper/Time-based Structuring Language – standard defined for multimedia documents markup • SGML architecture which specifies the generic hypermedia structure of documents