3. Text and document databases

3. Text and document databases • Normal databases: formatted records;document databases: free-form or semi-structured data (e.g. XML). • Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries /encyclopedias - Electronic newspapers - Source program libraries - Automated law and patent offices • What is a ‘document’? E.g. a book, chapter, paragraph, article, letter, web page, source program, etc. • General problem setting: Searching documents by contents;often called also associative search. • Usually based on keywords or terms occurring in documents. • Search terms may be combined with Boolean connectives(AND, OR, NOT) MMDB-3 J. Teuhola 2012

Non-indexed methods for string matching • Sequential full-text scanning; slow but some advantages: - No extra disk space required - Updates are fast (no index maintenance) - Partial-match retrieval is rather simple (using wildcard characters) - Approximate matching is also possible (using a threshold for edit distance) • Popular efficient algorithm: Boyer-Moore technique - Peculiar feature: searching is faster for longer search strings. - Based on preprocessing of the search string - Performance is sublinear in practice - Disk accesses cannot be reduced, except for extremely long search strings. • Several other string matching algorithms exist (skipped). MMDB-3 J. Teuhola 2012

Inverted indexing • Traditional way of improving search speed. • What means ‘inverted’?A document is a list of words, but the index gives for eachword the list of documents where the word appears. Example documents: D1: ”TO BE OR NOT TO BE” D2: ”TO BE IS TO DO” D3: ”DO BE DO BE DO” Inverted index: BE  {D1, D2, D3} DO  {D2, D3} IS  {D2} NOT  {D1} OR  {D1} TO  {D1, D2} MMDB-3 J. Teuhola 2012

Inverted indexing (cont.) • The set of words is called a lexicon. Some principles for it: - Case folding: Convert uppercase letters to lowercase - Stemming: Remove suffixes; index only the root forms of terms. - Do not include stopwords, like “the”, “is”, “as”, “that”, etc. which occur very often but do not bear semantic relevance. • The pointers to term occurrences may appear in different granularities: - Coarse-grained index identifies document groups where the term appears. - Moderate-grained index identifies the relevant documents - Fine-grained index contains sentence, word, or even byte numbers for term occurrences. MMDB-3 J. Teuhola 2012

Inverted indexing (cont.) • Coarse-grained index: - Small index size - Small maintenance penalty - Lot of plain text scanning - False drops for multi-term queries (terms do not co-occur). • Fine-grained index: - Large index size - High maintenance penalty - Supports proximity queries (terms occurring together) MMDB-3 J. Teuhola 2012

Inverted indexing (cont.) Ways to save storage space: • Front compression: Index in alphabetic order; the prefix common with the previous term is expressed compactly. • Tail (suffix) compression: Store terms to the point wherethey can be uniquely distinguished from other terms. • In fine-grained index: Instead of full pointers, store intervals ofsuccessive occurrences of a term. Compound queries: • AND: Retrieve pointer lists and compute their intersection. • OR: Retrieve pointer lists and form their union. • NOT: This is usually combined with AND, so that we can applyset difference to the pointer lists of terms. MMDB-3 J. Teuhola 2012

Data structures for inverted indexes (a) B+-tree, with index terms as keys and pointer lists as leaves. (b) Hash organization, preferably a dynamic version, e.g. - Linear hashing - Extendible hashing (c) Trie-structure: each node represents one character, and theterm is found by following a path from the root to a leaf. Problem: must be mapped to external storage In each case, the variable-length pointer lists can be either locally in the structure, or (preferably) detached. MMDB-3 J. Teuhola 2012

Algorithm for building an inverted index 1. For each document, gather the index terms, combined with pointers to the actual locations. The result is a big sequential file (called S). 2. Sort S by using e.g. external mergesort. 3. Combine the successive entries representing the same index term. 4. Build the index (B+-tree, hash table, ...) on the index terms and letthe leaf entries refer to the detached pointer lists, stored asvariable-length records. Assessment of inverted indexes: • A very effective access method for retrieval. • A very popular technique in practice for static document sets(in spite of the storage penalty). • Presumably the main retrieval tool in web search engines. MMDB-3 J. Teuhola 2012

d1 d2 d3 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 BE DO IS NOT OR TO Bitmap indexing • Another representation for the inverted index • Suitable for coarse and moderate-grained indexing • A bitmap is a matrix with a row for each index term, and a column for each document. Element <i, j> is 1, if term i occurs indocument j, otherwise 0. Example documents: d1. ”TO BE OR NOT TO BE” d2. ”TO BE IS TO DO” d3. ”DO BE DO BE DO” [Note: Normally these index terms would be stopwords.] MMDB-3 J. Teuhola 2012

Bitmap indexing (cont.) • Especially efficient for Boolean queries: AND, OR, NOT can be implemented directly in hardware (e.g. 64 bits in parallel). • Problem: High storage consumption (#terms  #documents);the matrix is usually sparse. • Possible combined structure:Use an inverted index for the less frequent terms, anda bitmap for the more frequent terms. • One option: compression. E.g. run-length coding: Replace sequences of zeroes by their count (which gets close to the normal inverted index). Also the encoding of integers has to be decided. Example: Bitmap row = 001010000010001100000100... Run-length code = <2, 1, 5, 3, 0, 5, ...> MMDB-3 J. Teuhola 2012

Hierarchical compression of bit strings • Divide the string into equal-sized blocks, • Apply disjunction (OR) to the bits within each block, creating a higher-level bit: A block of zeroes generates a 0-bit, others a 1-bit. • The process is repeated on higher levels, recursively. • Advantage: Single bits are easily accessible, by studying one pathin the tree. • Compression: zero blocks need not be stored, at all.Most of the leaf blocks are usually zero blocks (sparse bitmap). 1 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0000 0010 0000 00111000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 MMDB-3 J. Teuhola 2012

Signature indexing • A probabilistic technique • Can be generalized to any objects characterized by a variable-size set of index terms or descriptors (also called features). • Signature is a bit-array, generated by hashing the index terms to indexes of the array. It is usually at least some hundreds of bits long. • Signatures are collected to a separate signature file, which is smaller than the whole space required by the documents. • The signature file acts as a filtering mechanism, to reduce the amount of actual data to be searched. • The structure enables partial-match retrieval (subset of terms match) • Queries can also be considered a kind of documents (collection of keywords), and be converted into signature form.The query signature Q is compared against document signatures Di.If 1-bits of Q are included in 1-bits of Di, then Di is a candidate result. • Signatures are approximate descriptions of documents. False drops must be eliminated by checking the actual match of all candidates. MMDB-3 J. Teuhola 2012

Signature indexing (cont.) Advantages of signature files: • Low storage consumption, compared to (fine-grained) inverted indices. Typical value is 10-20% of the primary database size • More convenient than multidimensional indexes (to be studied later) Signature generation methods: • Word signaturemethod: Each index term is hashed into a sparse bit pattern, and the patterns are concatenated to form the document signature.This method usually results in higher false drop probability, but preserves sequencing information of terms in documents. • Superimposed coding: Each index term produces a bit pattern of full signature size. The patterns are OR’ed to form the document signature. This is here the default method. How to minimize false drops? • The proportions of 0’s and 1’s should be equal and uniformly distributed. (Weight = number of 1-bits in the signature) MMDB-3 J. Teuhola 2012

Signature indexing: Example Document index terms: D1: database, object, programming, schema D2: algorithm, computer, programming D3: algorithm, data structure, programming Hashing: hash(“algorithm”) = 3 hash(“computer”) = 1 hash(“database”) = 7 hash(“data structure”) = 5 hash(“object”) = 6 hash(“programming”) = 4 hash(“schema”) = 1 Signatures: D1 = 10010110, D2 = 10110000, D3 = 00111000 Query: Documents about “algorithm” and “programming”? Query signature: Q = 00110000, matches with D2 and D3. MMDB-3 J. Teuhola 2012

Models for information retrieval from documents • Boolean model • Queries consist of search terms, conneted by Boolean operators (AND, OR, NOT) • All documents containing a match are retrieved – no ranking. • Vector space model • Search terms are given weights in documents • Documents are ranked based on the distance between query and term vectors • Probabilistic model • Estimation of the probability of relevance between query and document, based on the relevance probabilities of the terms. • Popular version of this approach: BM25 (Best Match 25). MMDB-3 J. Teuhola 2012

Vector-based document retrieval Principles: • Retrieval with less precise queries; not always exact-match. • Ranking the documents according to their ‘distance’ from the query. • Semantic correlations of terms should be taken into account. Concepts: • Synonymy: Different terms mean the same thing. • Polysemy: A single term has multiple meanings. • Weight of a term indicates its importance in a document. The weight could be the number of term occurrences in the document. • Measuring the goodness of retrieval:- Precision: Proportion of retrieved relevant documents, relative to the total number of retrieved documents. - Recall: Proportion of retrieved relevant documents, relative to the total number of relevant documents. • The user decides which documents are relevant and which are not. MMDB-3 J. Teuhola 2012

Precision and recall Precision = X/ A All documents Recall = X/ B A = retrieved X = hits B = relevant MMDB-3 J. Teuhola 2012

D1 D2 D3 2 1 2 0 1 3 0 1 0 1 0 0 1 0 0 2 2 0 BE DO IS NOT OR TO Matrix representation of the term & document sets • Matrix D: one row per term, one column (Di) per document • Generalization of bitmap • Element Dt,i denotes the weight of term t in document i.The weight could be e.g. the number of term occurrences(more advanced scheme later). • Column vectors characterize documents. Example documents: D1: ”TO BE OR NOT TO BE” D2: ”TO BE IS TO DO” D3: ”DO BE DO BE DO” D MMDB-3 J. Teuhola 2012

Comparing queries and documents • Queries can be regarded as documents, as well, with their own characterizing vector, built from query terms (but usually without weights). • Task: Find k documents, whose vectors are closest to the query vector. • Problem: How to measure the distance (or similarity) between documents or usually query & document)? • First attempt: Similarity(Q, Di) = Q • Di = tTermsQt Dt,i,i.e. the inner product of query and document weight vectors. • Example: Query = {TO, DO}  (0, 1, 0, 0, 0, 1)Q D1 = (0, 1, 0, 0, 0, 1)  (2, 0, 0, 1, 1, 2) = 2Q D2 = (0, 1, 0, 0, 0, 1)  (1, 1, 1, 0, 0, 2) = 3 (best similarity)Q D3 = (0, 1, 0, 0, 0, 1)  (2, 3, 0, 0, 0, 0) = 3 (best similarity) MMDB-3 J. Teuhola 2012

Giving weights to terms Problems: • If frequencies ft,i (term instances per document) are used as weights, general terms are overweighted • Long documents are favored over short ones, because they contain more terms. Applying Zipf’s law: • The weight of an index term t should be inversely proportional to its frequency among documents, i.e.wt = 1/nt , where nt = number of documents containing term t. • An attenuated (weakened) form for this weight:wt= log (1+N/nt), where N = number of documents. • This can be combined with the term frequency, givingDt,i = ft,i· log (1+N/nt)which is called TF*IDF rule (“term frequency times inverse document frequency”). MMDB-3 J. Teuhola 2012

Measuring distance / similarity Geometric distances in vector space: • Euclidean distance (similarity =inverse of distance): This measure discriminates long documents: Dt,iis large, Q small. • Cosine rule: The angle (cosine) between query and document vectors in space is a good measure of their distance (similarity). From vector algebra:The obtained similarity measure: • Combined with TF*IDF: MMDB-3 J. Teuhola 2012

Search engine architecture • Acquisition of web documents: By crawling using the links, and monitoring RSS feeds • The documents are stored, preprocesses to text, and indexed. • Preprocessing tasks: parsing, stopword removal, stemming, link extraction, classification, etc. • Indexing:collecting statistics, giving weight to terms, and building the index - inverted indexes are the common index type. • Query processing: preprocessing, retrieval, ranking, output, relevance feedback. • Main goals: relevance of query results, and efficient processing • Problem of scale: the amounts of documents and queries are huge.Distribution, parallelism, replication, compression ... Are needed. MMDB-3 J. Teuhola 2012

Ranking in Web search engines • Search steps: • Normal search by keywords resulting in ’hits’ • Ranking of the hit documents • Main difference between web pages and traditional sets of documents: (hyper)links. Links are the most important factor in ranking documents. • Other ranking factors: • Content relevance measure, • Number of visits • Estimated (formal) quality of the content, • Page loading time • Financial promotion MMDB-3 J. Teuhola 2012

HITS algorithm for ranking • Retrieved pages are given two scores: • Authority score (ai) represents how respected the page is, knowing the incoming links, and the scores of the referring pages. • Hub score (hi) measures the goodness of the page, in view of the scores of the pages to which it refers. • The scores are inter-related; one cannot be decided before the other! • Solution: Start with initial scores ai(0)and hi(0), and iterate: and for k = 1, 2, ... until scores stabilize • Normalization of score vectors must be done at each iteration • The process actually computes the dominanteigenvectors of MTM and MMT, where M is the adjacency matrix of the page-link graph. • Problem with HITS algorithm: Scores are computed during query processing – not precomputed. MMDB-3 J. Teuhola 2012

PageRank algorithm • The famous basis of Google’s ranking method – still essential! • Rank (importance) of a webpage is determined by the importance of the pages referring to it. • Network flow idea: if a page has importance r, and k outgoing links, importance ’flows’ equally to the referred pages, namely r/k for each • Iterative computation: where - L goes through the pages linking to P. - f(L) denotes the fanout (#out-pointers) of L • In matrix algebra: convergence to the left eigenvector of a transition matrix with entry 1/f(Pi) on row i at columns where P links to. • Difference from HITS: PageRank values are independent of queries • Problem with both: semantic correlations are not well considered. MMDB-3 J. Teuhola 2012

Latent semantic indexing (LSI) Problem with the plain term-document approach: • Terms have synonyms, and documents often resemble each other. • The term-document matrix is usually large and sparse (mostly 0). Latent semantic indexing (LSI): • Reduction of the search space (and the matrix) • Representation of terms and documents in ‘semantic space’, by deriving a set of uncorrelated factors (‘concepts’) Steps in LSI: 1. Create the weighted term-document matrix D. 2. Compute a singular valued decomposition (A, S, B) of D by splittingD into three matrices A, S, and B. 3. Reduce the size of matrices by eliminating insignificant rows/columns. 4. Store the matrices, using any of the available indexing techniques. MMDB-3 J. Teuhola 2012

LSI: Singular value decomposition (SVD) • Given any matrix D of size mn, it is possible to find matricesA, S, and B such that - D = A  S  BT - A is an orthogonalmm matrix, i.e. ATA = I - B is an orthogonal nn matrix, i.e. BTB = I - S is a diagonalmn matrix (called singular matrix) where nonzero elements are on the diagonal from top-left in non-increasing order D A S BT mm mn nn mn  =  MMDB-3 J. Teuhola 2012

LSI: Singular value decomposition (SVD) Idea: • Reduce r to a smaller value k, such that the least significant (bottom-right) elements of S are discarded, as well as the corresponding columns in A and rows in B. • As a result we need to store only the reduced matrices Ak (mk), Sk (kk), and Bk (kn). D Ak Sk BkT    MMDB-3 J. Teuhola 2012

LSI usage • Query processing:Compute the transformed query vector Qk = QT AkSk-1and apply the vector similarity search to Qk using matrix Bk. • Update: Complicated; LSI can be recommended mainly for semi-static document collections. • Strength: LSI is able to identify concepts or patterns among terms, based on their co-occurrence in the documents.Semantic correlations are extracted from ‘noise’. • Note: LSI resembles Principal Component Analysis (PCA) – a well-known dimensionality-reduction method based on eigenvectors. MMDB-3 J. Teuhola 2012

Semi-structured documents: XML • XML = eXtensible Markup Language • Accepted 1998 by W3C (World Wide Web Consortium) • Simplified form of SGML (Standard Generalized Markup Language) • Document = tree structure (hierarchy) of elements • Element enclosed by start and end tags • Elements can be references to media objects • Elements can be further described by attributes (in the start tag) • Differences from HTML: • Logical content separated from physical layout • Extensible: new tags can be adopted according to need • XML has also other purposes than web publishing MMDB-3 J. Teuhola 2012

Example XML document books <?XML version=”1.0”?> <books> <book isbn=”123-456-789"> <title>Database systems</title> <authors> <author>Elmasri</author> <author>Navathe</author> </authors> </book> <book isbn=”987-654-321”> <title>Multimedia databases</title> <authors> <author>Dunckley</author> </authors> </book> </books> book* title authors author+ MMDB-3 J. Teuhola 2012

XML-based markup languages • Scalable Vector Graphics (SVG):Presentation of variable-size vector graphics on screen. • Office Open XML (OOXML; OpenXML):File format for representing office documents like (rich) text, spreadsheets, slide presentations, etc. • Web services:Applications that can communicate with other applications using standard protocols (http) over the Internet. • Mathematical Markup Language (MathML):Presentation of mathematical formulas. • Chemical Markup Language (CML): Presentation of chemical formulas. • and many others ... MMDB-3 J. Teuhola 2012

XML-based markup languages (cont.) Synchronized Multimedia Integration Language (SMIL): • Controls layout, interaction, operation and timing of multimedia presentations • Gathers the media files in the order that they should appear • Combines them into a single stream • Viewing by SMIL-enabled player (e.g. Ambulant 2.0) • W3C recommendation, see http://www.w3.org/AudioVideo/Latest official SMIL 2.1, latest proposal SMIL 3.0 (Dec. 2008) • Several media players support (e.g. RealPlayer; IE partially) • Tutorial: http://www.w3schools.com/smil/default.asp MMDB-3 J. Teuhola 2012

SMIL code example <smil> <head> <layout> <root-layout height=“250" width=“300" background-color="#ffffff" title=“Z"/> <region id="title" width=“200" height=“100" top="0" left="0" z-index=“1" /> <region id=“img" width="200" height=“150" top=“50" left=“40" z-index=“2" /> </layout> </head> <body> <seq> <text src="http://www. xxx/head.txt" region="title" begin="2.00s” end=“3.00s" /> <par> <text src="http://www.yyy/text.txt" region=“title" /> <img src="http://www.ttt/fig.gif" region=“img" begin="1.00s" end=“10.00s“ /> <audio src="http://www.zzz/music.rm" begin=“5.00s" end=“10.00s" /> </par> </seq> </body> </smil> MMDB-3 J. Teuhola 2012

XML and databases Some storage alternatives: • Normalized database + transformation of query results into XML.Advantage: Uniform presentation of database objects in heterogeneous, distributed and multi-tier database systems. • Storage of the XML code as an attribute value in the document table (together with document id and other separated search attributes). The XML attribute is logically unnormalized.Advantage: No transformation needed for viewing. • Storage of XML document elements and parent links as attributes in a relation. This needs careful indexing. • Native XML database:Requires a query language (interface) and indexing support. MMDB-3 J. Teuhola 2012

XQuery: XML query language • Developed by W3C (WWW Consortium), see http://www.w3.org/TR/xquery/ • XQuery 1.0, latest version January 2007. • Important consitutent: Path expressions (using Xpath) • Control: • looping (FOR) • variable binding (LET) • selection condition (WHERE) • creating result (RETURN) • Arithmetic and logical operators • Sorting; sequence processing • XQuery syntax alternatives: SQL- or XML-oriented MMDB-3 J. Teuhola 2012

XML query examples XPath: Find titles of articles with type ”draft”: collection(’articles’)/article[@type=”draft”]/title XQuery: Find authors of articles written in 2005 (join of two document collections): for $art in collection(’articles’)/article[@year=”2005”] let $author := collection(’authors’)/author[@id=$art/auth_id] return <result> <title> { $art/title } </title> <author> { $author/name } </author> <result> MMDB-3 J. Teuhola 2012

XML support in database management systems • Commercial DBMSs extended by XML and XQuery support: • IBM DB2 9 ‘Viper’ • Oracle 11g XML DB: • Microsoft SQL Server 2005 • Some ’native’ XML databases: • dbXML(open-source) • eXist(open-source) • xDB(commercial) MMDB-3 J. Teuhola 2012

3. Text and document databases

3. Text and document databases

Presentation Transcript

Document (Text) Visualization

Document Design 3

Text Document Clustering

Introduction to Document Databases with RavenDB

Text Based Information Retrieval Document Clustering / Classification Lecture 3

Efficient full-text search in databases

QUERY AND DOCUMENT EXPANSION IN TEXT RETRIEVAL

Document Databases for Information Management

E-3 Document

Spatial, text, and multimedia databases

Modeling and Managing Content Changes in Text Databases

New Compression Codes for Text Databases

Academic Databases Business Databases Full Text Linking

Text Databases

EBSCO Host databases and full-text online journals

Convert PDF to text document

Document Image Databases and Retrieval

Document Collections 3

Text Databases