1 / 17

WMES3103 : INFORMATION RETRIEVAL

WMES3103 : INFORMATION RETRIEVAL. TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES. INTRODUCTION. Text - main form of communicating data and information Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive

mizell
Download Presentation

WMES3103 : INFORMATION RETRIEVAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WMES3103 : INFORMATION RETRIEVAL TEXT AND MULTIMEDIA LANGUAGES AND PROPERTIES

  2. INTRODUCTION • Text - main form of communicating data and information • Text also supplemented with multimedia elements - to make the contents of an IRS more attractive and interactive • Website with a combination ot text and multimedia will be visited by many as compared to one which is text-based only • IRS - text and multimedia is depicted via special languages.

  3. Metadata • New concept on information – metadata • Information about data arrangement, data domain and relationship between the two • Data about data • 2 types – descriptive and semantic

  4. descriptive Metadata – metadata which explain about document or one unit of information • Commonly used Metadata : • Authors • Date of publication • Source of publication • Length of document • Type of document

  5. Metadata • semantic Metadata –resembles subject that can be obtain from the contents of the document – subjects heading • Keywords • LC Code

  6. TEXT • With computers, we need to code text into binary digits • First coding schemes – EBCDIC and ASCII – 7 bits to code each symbol • Then, ASCII changed to 8 bits to accommodate other languages, accents and diacritical marks • Oriental languages – Unicode – 16 bits

  7. TEXT Formats • No one single format for a text document • Good IRS system should be able to retrieve information from any format • Initially, IRS will convert a document to an internal format but this had a lot of disadvantages • Now, many new format has been developed for document interchange

  8. TEXT • RTF – Rich Text Format for word processing • PDF – Portable Document Format for displaying and printing documents • Postscript – powerful programming language for drawing • MIMT – Multipurpose Internet Mail Exchange to encode e-mail • Files are compressed – Compress (Unix), ARJ (PCs), ZIP • Convert binary files to ASCII text –uuencode/uudecode, binhex

  9. MARKUP LANGUAGES • Markup = extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc. • Formal markup languages are more structured • Marks = tags - initial and ending tag surrounding the marked text • Standard metalanguage = SGML • New metalanguange for Web = XML (eXtensible Markup Language) = subset of SGML • Most popular markup language used for the Web = HTML (HyperText Markup Language)

  10. MULTIMEDIA • Applications that handle different types of digital data originating from distinct types of media • Text, sound, images, video • Digital data distinct and different in volume, format, and processing requirements • Different types of formats necessary for storing each type of media

  11. MULTIMEDIA • Different formats used commonly on the Web and in digital libraries • Images • Audio • Moving Images • Textual Images • Graphics and Virtual Reality

  12. IMAGES • XBM, BMP, PCX – direct representation of a bit-mapped (or pixel-based) • GIF (Graphic Interchange Format) – includes compression and good for black or white or with small number of clours or gray levels (256) • JPEG (Joint Photographic Experts Group) – includes compression • TIFF (Tagged Image File Format) – used to exchange different documents between different applications and different computer platforms • TGA (Television Targa image file) – associated with video game boards • Various other image formats

  13. AUDIO • Must be digitized before storage • AU, MIDI (standard format to interchange music between electronic instruments and computers), WAVE – for small pieces of digital audio • Audio libraries – RealAudio or CD formats • Animation or moving pictures • MPEG (Moving Pictures Expert Group) – related to JPEG • Others – AVI, FLI, QuickTime

  14. TEXTUAL IMAGES • Images that contain mainly typed or typeset text • Obtained by scanning the documents • For archival purposes • Saved as images but with further compression • Textual and non-textual stored and compressed separately and when neded can be combined and displayed together

  15. GRAPHICS AND VIRTUAL REALITY • 3-dimensional graphics found on Web • CGM (Computer Graphics Metafile) standard • Metafile = collection of elements • CGM standard specifies which elements are allowed to occur in which positions in a metafile • VRML (Virtual Reality Modeling Language) – file format for describing interactive 3D objects and worlds - universal interchange format for 3D graphics and multimedia - can be used for various applications

  16. MULTIMEDIA DOCUMENTS MARKUP • HyTime = Hyper/Time-based Structuring Language – standard defined for multimedia documents markup • SGML architecture which specifies the generic hypermedia structure of documents

More Related