220 likes | 357 Views
Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties. Alexander Gelbukh www.Gelbukh.com. Previous chapter: Conclusions. Query operations: Relevance feedback Simple, understandable Needs user attention Term re-weighting
E N D
Special Topics in Computer ScienceThe Art of Information RetrievalChapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh www.Gelbukh.com
Previous chapter: Conclusions Query operations: • Relevance feedback • Simple, understandable • Needs user attention • Term re-weighting • Local analysis for query expansion • Co-occurrences in the retrieved docs • Usually gives better results than global analysis • Computationally expensive • Global analysis • Worse results. What is good for collection is not for a query • Linguistic methods, dictionaries, ontologies, stemming, ...
Previous chapter: Trends and research topics • Interactive interfaces • Graphical, 2D or 3D • Refining global analysis techniques • Application of linguistics methods. Stemming. Ontologies • Local analysis for the Web (now too expensive) • Combine the tree techniques (feedback, local, global)
Anatomy of a document... • We search for documents • What is a document?
Characteristics of a document • Syntax is a device that “plays” the document producing semantics (kind of: presentation) • Like CD drive plays CD to produce music • Knowing Korean + paper w/glyphs meaning
...Anatomy of a document • Queries are conditions on semantics/presentation, not on (binary?) data of the document • Thus need to know syntax • Example: search in PS or PDF • How to describe formally?
Metadata • Info about the organization of data • Data about the data • Descriptive vs. Semantic metadata • Descriptive: about creation: author, date, ... • Semantic: about meaning: keywords, subject codes, ... • Ontologies • Others: who and how to use. E.g.: adult, confident, signature • Standards (many) • Dublin Core Metadata Element Set: 15 fields. Descriptive. • Machine Readable Catalog Record (MARC): bibliographic • WEB – very important • Many projects on Web ontologies. Semantic Web.
Text • Encoding. ASCII-7, 8. UNICODE: oriental • Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS • Compression. ZIP, ARJ • Binary in ASCII: uuencode To predict behavior of tools and systems, need to model text • Entropy: the limit of compression, degree of chaos • Statistics of the letters and words • Very skewed
Zipf law, etc. • i-th most frequent word appears k/i times, = 1.5 – 2 • Mandelbrot form: k/(c+i) • 50% of text are few hundred words • Most of them are stopwords: the, of, and, a, to, in... • Not indexed smaller indices • Distribution of words by docs • p, k depend on collection and word
Heaps’ law, etc. • # of distinct words (size of vocabulary) = Kn, • = 0.4 ... 0.6 square root • Applies to collections • To WWW • Average length of word. English: • 4.8 ... 5.3 letters per word, average by text • 6 ... 7 without stopwords • 8 ... 9 average by vocabulary
Similarity between strings • symmetric; triangle: dist (a,c) dist (a,b) + dist (b,c) • Hamming: # of different positions. Also for sets. • Soundex: phonetic similarity • Levenshtein: min # insertions, deletions, substitutions • dist (survey, surgery) = 2 • A very good measure • Longest common subsequence: survey, surgerysurey • Various metrics to compare whole docs • E.g., consider strings as symbols, or similarity of strings, etc.
Markup languages “Our documents do not belong to us but to Bill Gates!” • Extra textual syntax to describe formatting, structure, ... • Marks are called tags. • Initial and ending tags surround the marked text. • Standard metalanguage: SGML (Standard Generalized Markup Language) • XML (eXtensible), its subset: new metalanguage for Web • HTML is an instance of SGML
SGML • Provides rules for defining tags • A document consists of: • Definitions of tags • Document Type Declaration, DTD • Informal comments or an additional description • Text with tags • Tags: <tag>text</tag> • Mostly defines semantics, not printing format • Defined in other languages
HTML • 1992; 4.0: 1997 • Instance of SGML • Exists DTD, usually not used • Also does not define (much of) formatting. Thus: • Cascade Style Sheets (CSS) • define aspects of formatting • can be combined (cascaded) • not well supported by browsers • Does NOT (unlike generic SGML ( too expensive)) • allow to specify new tags • support nesting structures • support validity checks
XML (eXtensible ...) • More flexible than HTML, simpler than SGML • Simplified subset of SGML • Much simpler in implementation • Allows for human- and machine-readable markup • Good for development of Web docs • Allow to do things that now are done with Java scripts • Using DTD is optional, parser can discover tags • Extensible Style sheet Language (like CSS in HTML) • Like macros in a word processor • Extensible Linking Language
Uses of XML • MathML: Mathematical Markup Language • Not only presentation but also meaning of expressions! • SMIL: Synchronized Multimedia Integration Language • Declarative language to specify positions and timing • Resource Description Format • Metadata for XML Trend: HTML evolutions to model and describe the structure of data, not presentation details
Multimedia • Text, sound, images, video • Image formats. BMP. Compression: • GIF. Good for few colors • JPG. Lossy compression. Parametric: can be controlled • TIFF is used for exchange; can contain metadata • Moving images: • MPEG: Moving Pictures Expert Group. Encodes changes • Textual images. Compression. Retrieval: • Metadata, keywords • OCR. Many typos; keyword search should be approximate • Treat as a sequence of images, convert query similarly
Conclusions • Modeling of text helps predict behavior of systems • Zipf law, Heaps’ law • Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search • Languages to describe document syntax • SGML, too expensive • HTML, too simple • XML, good combination
The class of Oct 30 is cancelled Thank you! Till November 6