1 / 22

Alexander Gelbukh Gelbukh

Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties. Alexander Gelbukh www.Gelbukh.com. Previous chapter: Conclusions. Query operations: Relevance feedback Simple, understandable Needs user attention Term re-weighting

nelia
Download Presentation

Alexander Gelbukh Gelbukh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Computer ScienceThe Art of Information RetrievalChapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh www.Gelbukh.com

  2. Previous chapter: Conclusions Query operations: • Relevance feedback • Simple, understandable • Needs user attention • Term re-weighting • Local analysis for query expansion • Co-occurrences in the retrieved docs • Usually gives better results than global analysis • Computationally expensive • Global analysis • Worse results. What is good for collection is not for a query • Linguistic methods, dictionaries, ontologies, stemming, ...

  3. Previous chapter: Trends and research topics • Interactive interfaces • Graphical, 2D or 3D • Refining global analysis techniques • Application of linguistics methods. Stemming. Ontologies • Local analysis for the Web (now too expensive) • Combine the tree techniques (feedback, local, global)

  4. Anatomy of a document... • We search for documents • What is a document?

  5. Characteristics of a document • Syntax is a device that “plays” the document producing semantics (kind of: presentation) • Like CD drive plays CD to produce music • Knowing Korean + paper w/glyphs  meaning

  6. ...Anatomy of a document • Queries are conditions on semantics/presentation, not on (binary?) data of the document • Thus need to know syntax • Example: search in PS or PDF • How to describe formally?

  7. Metadata • Info about the organization of data • Data about the data • Descriptive vs. Semantic metadata • Descriptive: about creation: author, date, ... • Semantic: about meaning: keywords, subject codes, ... • Ontologies • Others: who and how to use. E.g.: adult, confident, signature • Standards (many) • Dublin Core Metadata Element Set: 15 fields. Descriptive. • Machine Readable Catalog Record (MARC): bibliographic • WEB – very important • Many projects on Web ontologies. Semantic Web.

  8. Text • Encoding. ASCII-7, 8. UNICODE: oriental • Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS • Compression. ZIP, ARJ • Binary in ASCII: uuencode To predict behavior of tools and systems, need to model text • Entropy: the limit of compression, degree of chaos • Statistics of the letters and words • Very skewed

  9. Zipf law

  10. Zipf law, etc. • i-th most frequent word appears k/i times,  = 1.5 – 2 • Mandelbrot form: k/(c+i) • 50% of text are few hundred words • Most of them are stopwords: the, of, and, a, to, in... • Not indexed  smaller indices • Distribution of words by docs • p, k depend on collection and word

  11. Heaps’ law

  12. Heaps’ law, etc. • # of distinct words (size of vocabulary) = Kn, •  = 0.4 ... 0.6  square root • Applies to collections • To WWW • Average length of word. English: • 4.8 ... 5.3 letters per word, average by text • 6 ... 7 without stopwords • 8 ... 9 average by vocabulary

  13. Similarity between strings • symmetric; triangle: dist (a,c)  dist (a,b) + dist (b,c) • Hamming: # of different positions. Also for sets. • Soundex: phonetic similarity • Levenshtein: min # insertions, deletions, substitutions • dist (survey, surgery) = 2 • A very good measure • Longest common subsequence: survey, surgerysurey • Various metrics to compare whole docs • E.g., consider strings as symbols, or similarity of strings, etc.

  14. Markup languages “Our documents do not belong to us but to Bill Gates!” • Extra textual syntax to describe formatting, structure, ... • Marks are called tags. • Initial and ending tags surround the marked text. • Standard metalanguage: SGML (Standard Generalized Markup Language) • XML (eXtensible), its subset: new metalanguage for Web • HTML is an instance of SGML

  15. SGML • Provides rules for defining tags • A document consists of: • Definitions of tags • Document Type Declaration, DTD • Informal comments or an additional description • Text with tags • Tags: <tag>text</tag> • Mostly defines semantics, not printing format • Defined in other languages

  16. HTML • 1992; 4.0: 1997 • Instance of SGML • Exists DTD, usually not used • Also does not define (much of) formatting. Thus: • Cascade Style Sheets (CSS) • define aspects of formatting • can be combined (cascaded) • not well supported by browsers • Does NOT (unlike generic SGML ( too expensive)) • allow to specify new tags • support nesting structures • support validity checks

  17. XML (eXtensible ...) • More flexible than HTML, simpler than SGML • Simplified subset of SGML • Much simpler in implementation • Allows for human- and machine-readable markup • Good for development of Web docs • Allow to do things that now are done with Java scripts • Using DTD is optional, parser can discover tags • Extensible Style sheet Language (like CSS in HTML) • Like macros in a word processor • Extensible Linking Language

  18. Uses of XML • MathML: Mathematical Markup Language • Not only presentation but also meaning of expressions! • SMIL: Synchronized Multimedia Integration Language • Declarative language to specify positions and timing • Resource Description Format • Metadata for XML Trend: HTML evolutions to model and describe the structure of data, not presentation details

  19. Multimedia • Text, sound, images, video • Image formats. BMP. Compression: • GIF. Good for few colors • JPG. Lossy compression. Parametric: can be controlled • TIFF is used for exchange; can contain metadata • Moving images: • MPEG: Moving Pictures Expert Group. Encodes changes • Textual images. Compression. Retrieval: • Metadata, keywords • OCR. Many typos; keyword search should be approximate • Treat as a sequence of images, convert query similarly

  20. Taxonomy of Web languages

  21. Conclusions • Modeling of text helps predict behavior of systems • Zipf law, Heaps’ law • Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search • Languages to describe document syntax • SGML, too expensive • HTML, too simple • XML, good combination

  22. The class of Oct 30 is cancelled Thank you! Till November 6

More Related