650 likes | 1.03k Views
Chapter 7. Text and Multimedia Languages and Properties. Mohd Shahizan Othman Jabatan Sistem Maklumat Fakulti Sains Komputer & Sistem Maklumat. Anatomy of a document. We search for documents What is a document?. What is a document?. document: a single unit of information
E N D
Chapter 7 Text and Multimedia Languages and Properties Mohd Shahizan Othman Jabatan Sistem Maklumat Fakulti Sains Komputer & Sistem Maklumat
Anatomy of a document... • We search for documents • What is a document?
What is a document? • document: a single unit of information • complete logical unit • research paper, book, manual • part of a larger text • paragraph, passage, an entry in a dictionary, … • a physical unit • file, email, Web page
Characteristics of a document How a document is displayed or printed Document Presentation Style Text + Structure + Other Media Syntax Semantics Express structure, presentation style, or even external actions Author implicit, or expressed in a language Creator
...Anatomy of a document • Queries are conditions on semantics/presentation, not on (binary?) data of the document • Thus need to know syntax • Example: search in PS or PDF • How to describe formally?
Metadata • Info about the organization of data • Data about the data • Descriptive vs. Semantic metadata • Descriptive: about creation: author, date, ... • Semantic: about meaning: keywords, subject codes, ... • Ontologies • Others: who and how to use. E.g.: adult, confident, signature • Standards (many) • Dublin Core Metadata Element Set: 15 fields. Descriptive. • Machine Readable Catalog Record (MARC): bibliographic • WEB – very important • Many projects on Web ontologies. Semantic Web.
Metadata • Descriptive Metadata • metadata which explain about document or one unit of information • Metadata that is external to the meaning of the document • Commonly used Metadata : • Authors • Date of publication • Source of publication • Length of document • Type of document Dublin Core
Metadata • Semantic Metadata – • Metadata that can be found within the document’s content • Resembles subject that can be obtain from the contents of the document – subjects heading • Keywords • LC Code Library of Congress subject codes
Web Metadata • purposes • cataloging (e.g., BibTex) • content rating • Protect children from reading some type of documents • intellectual property rights • digital signatures (for authentication) • privacy levels • applications to electronic commerce • … • RDF (Resource Description Framework)
Resource Description Framework • description of nodes and attached attribute/value pairs • nodes: any Web resource • attributes: properties of nodes • values: text strings or other nodes (Web resources or metadata instances)
Resource Description Framework Resource Property Value Subject Predicate Object Statement
DC in RDF dc:type dc:coverage Resource dc:title dc:creator dc:subject dc:contributor dc:description dc:publisher dc:identifier dc:date dc:rights dc:relation dc:language dc:format dc:source
A DC Example in RDF http://x.html Shah dc:creator <RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://x.html”> <dc:creator> Shah </dc:creator> </Description> </RDF>
Resource Description Framework dc:title http://web.utm.my/fsksm/staf/shahizan “Web Pensyarah” “Shahizan” dc:creator <RDF xmlns = “http://www.w3.org/TR/WD-rdf-syntax#” xmlns:dc = “http://purl.org/dc/elements/1.0/”> <Description about = “http://www.lis.ntu.edu.tw/~khchen/”> <dc:Title> Web pensyarah </dc:Title> <dc:Creator> Shahizan </dc:Creator> </Description> </RDF>
Text • With computers, we need to code text into binary digits • First coding schemes – EBCDIC and ASCII – 7 bits to code each symbol. • Then, ASCII changed to 8 bits to accommodate other languages. Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS • Compression. ZIP, ARJ • Binary in ASCII: uuencode • Oriental languages – Unicode – 16 bits
Text • Formats • Basic form • ASCII, … • Document interchange • Rich Text Format (RTF): used by word processors • Portable Document Format (PDF) and Postcript: used for display or printing documents • MIME (Multipurpose Internet Mail Exchange): support multiple character sets, multiple languages, and multiple media
Text Formats • No one single format for a text document • Good IRS system should be able to retrieve information from any format • Initially, IRS will convert a document to an internal format but this had a lot of disadvantages • Now, many new format has been developed for document interchange
Text • RTF – Rich Text Format for word processing • PDF – Portable Document Format for displaying and printing documents • Postscript – powerful programming language for drawing • MIMT – Multipurpose Internet Mail Exchange to encode e-mail • Files are compressed – Compress (Unix), ARJ (PCs), ZIP • Convert binary files to ASCII text –uuencode/uudecode, binhex
V F Text size Words Zipf Law Heaps’ law There are a few hundred words which take up 50% of the text. Words (stopwords) that are too frequent can be disregarded.
Semantic Web "The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web-a web of data that can be processed directly or indirectly by machines." --Tim Berners-Lee, Weaving the Web, Harper San Francisco, 1999
Semantic Web • Tim Berners-Lee has a two-part vision for the future of the Web. • To make the Web a more collaborative medium. • To make the Web understandable, and thus processable, by machines.
Markup languages • Markup = extra textual syntax that can be used to describe formatting actions, structure information, text semantics, attributes, etc. • Formal markup languages are more structured • Marks are called tags. • Initial and ending tags surround the marked text. • Standard metalanguage: SGML (Standard Generalized Markup Language) • XML (eXtensible), its subset: new metalanguage for Web • HTML is an instance of SGML
SGML • Provides rules for defining tags • A document consists of: • Definitions of tags • Document Type Declaration, DTD • Informal comments or an additional description • Text with tags • Tags: <tag>text</tag> • Mostly defines semantics, not printing format • Defined in other languages
HTML • 1992; 4.0: 1997 • Instance of SGML • Exists DTD, usually not used • Also does not define (much of) formatting. Thus: • Cascade Style Sheets (CSS) • define aspects of formatting • can be combined (cascaded) • not well supported by browsers • Does NOT (unlike generic SGML ( too expensive)) • allow to specify new tags • support nesting structures • support validity checks
Introduction to XML • What You Should Already Know Before you continue you should have a basic understanding of the following: • HTML / XHTML • JavaScript or VBScript XML was designed to describe data and to focus on what data is. HTML was designed to display data and to focus on how data looks.
What is XML? • XML stands for EXtensible Markup Language • XML is a markup language much like HTML • XML was designed to describe data • XML tags are not predefined. You must define your own tags • XML uses a Document Type Definition (DTD) or an XML Schema to describe the data • XML with a DTD or XML Schema is designed to be self-descriptive • XML is a W3C Recommendation
The Main Difference Between XML and HTML • XML was designed to carry data. • XML is not a replacement for HTML.XML and HTML were designed with different goals: • XML was designed to describe data and to focus on what data is.HTML was designed to display data and to focus on how data looks. • HTML is about displaying information, while XML is about describing information.
XML Does not DO Anything • XML was not designed to DO anything. • Maybe it is a little hard to understand, but XML does not DO anything. XML was created to structure, store and to send information. XML is Free and Extensible XML tags are not predefined. You must "invent" your own tags. *Predefined HTML standard <p>, <h1>
XML: Just Tags? <Contact contact_id=“ ”> <first_name> Mohd Shahizan </first_name> <last_name> Othman </last_name> <organization> University Teknologi Malaysia </organization> <email> shahizan@utm.my </email> <phone> 07-5532424 </phone> </Contact> XML supports user-defined languages that add meaning to data.
XML is a Complement to HTML • XML is not a replacement for HTML It is important to understand that XML is not a replacement for HTML. In future Web development it is most likely that XML will be used to describe the data, while HTML will be used to format and display the same data. • XML is a cross-platform, software and hardware independent tool for transmitting information.
XML can Separate Data from HTML • With XML, your data is stored outside your HTML. When HTML is used to display data, the data is stored inside your HTML. With XML, data can be stored in separate XML files. This way you can concentrate on using HTML for data layout and display, and be sure that changes in the underlying data will not require any changes to your HTML. • XML data can also be stored inside HTML pages as "Data Islands". You can still concentrate on using HTML only for formatting and displaying the data.
The XML Advantage • XML files are human-readable. XML was designed as text so that, in the worst case, someone can always read it to figure out the content. Such is not the case with binary data formats. • Widespread industry support exists for XML. Numerous tools and utilities are being provided with Web browsers, databases, and operating systems, making it easier and less expensive for small and medium-sized organizations to import and export data in XML format. • Major relational databases now have the native capability to read and generate XML data. • A large family of XML support technologies is available for the interpretation and transformation of XML data for Web page display and report generation
Uses of XML • MathML: Mathematical Markup Language • Not only presentation but also meaning of expressions! • SMIL: Synchronized Multimedia Integration Language • Declarative language to specify positions and timing • Resource Description Format • Metadata for XML Trend: HTML evolutions to model and describe the structure of data, not presentation details
Perbezaan antara HTML dan XML <html> <head><title>Name</title></head> <body> <p>Shahizan Othman</p> </body> </html> <name> <first>Shahizan</first> <last>Othman</last> </name> XML HTML Shahizan Othman
Sintaks Namespaces <?xml version="1.0"?> <pers:person xmlns:pers="http://www.wiley.com/pers" xmlns:html="http://www.w3.org/1999/xhtml"> <pers:name> <pers:title>En</pers:title> <pers:first>Shah</pers:first> <pers:middle>Bin</pers:middle> <pers:last>Othman</pers:last> </pers:name> <pers:position>Vice President of Marketing</pers:position> <pers:résumé> <html:html> <html:head><html:title>Resume Shah Othman</html:title></html:head> <html:body> <html:h1>Shah Othman</html:h1> <html:p>Selamat pagi Shah</html:p> </html:body> </html:html> </pers:résumé> </pers:person>
XML and the Web XML integrates with standard Web protocols such as HTTP and FTP.
Multimedia • Text, sound, images, video • Image formats. BMP. Compression: • GIF. Good for few colors • JPG. Lossy compression. Parametric: can be controlled • TIFF is used for exchange; can contain metadata • Moving images: • MPEG: Moving Pictures Expert Group. Encodes changes • Textual images. Compression. Retrieval: • Metadata, keywords • OCR. Many typos; keyword search should be approximate • Treat as a sequence of images, convert query similarly