380 likes | 395 Views
3. Text and document databases. Normal databases: formatted records; document databases: free-form or semi-structured data (e.g. XML).
E N D
3. Text and document databases • Normal databases: formatted records;document databases: free-form or semi-structured data (e.g. XML). • Application areas: - Office automation, document archives - Digital libraries - Electronic dictionaries /encyclopedias - Electronic newspapers - Source program libraries - Automated law and patent offices • What is a ‘document’? E.g. a book, chapter, paragraph, article, letter, web page, source program, etc. • General problem setting: Searching documents by contents;often called also associative search. • Usually based on keywords or terms occurring in documents. • Search terms may be combined with Boolean connectives(AND, OR, NOT) MMDB-3 J. Teuhola 2012
Non-indexed methods for string matching • Sequential full-text scanning; slow but some advantages: - No extra disk space required - Updates are fast (no index maintenance) - Partial-match retrieval is rather simple (using wildcard characters) - Approximate matching is also possible (using a threshold for edit distance) • Popular efficient algorithm: Boyer-Moore technique - Peculiar feature: searching is faster for longer search strings. - Based on preprocessing of the search string - Performance is sublinear in practice - Disk accesses cannot be reduced, except for extremely long search strings. • Several other string matching algorithms exist (skipped). MMDB-3 J. Teuhola 2012
Inverted indexing • Traditional way of improving search speed. • What means ‘inverted’?A document is a list of words, but the index gives for eachword the list of documents where the word appears. Example documents: D1: ”TO BE OR NOT TO BE” D2: ”TO BE IS TO DO” D3: ”DO BE DO BE DO” Inverted index: BE {D1, D2, D3} DO {D2, D3} IS {D2} NOT {D1} OR {D1} TO {D1, D2} MMDB-3 J. Teuhola 2012
Inverted indexing (cont.) • The set of words is called a lexicon. Some principles for it: - Case folding: Convert uppercase letters to lowercase - Stemming: Remove suffixes; index only the root forms of terms. - Do not include stopwords, like “the”, “is”, “as”, “that”, etc. which occur very often but do not bear semantic relevance. • The pointers to term occurrences may appear in different granularities: - Coarse-grained index identifies document groups where the term appears. - Moderate-grained index identifies the relevant documents - Fine-grained index contains sentence, word, or even byte numbers for term occurrences. MMDB-3 J. Teuhola 2012
Inverted indexing (cont.) • Coarse-grained index: - Small index size - Small maintenance penalty - Lot of plain text scanning - False drops for multi-term queries (terms do not co-occur). • Fine-grained index: - Large index size - High maintenance penalty - Supports proximity queries (terms occurring together) MMDB-3 J. Teuhola 2012
Inverted indexing (cont.) Ways to save storage space: • Front compression: Index in alphabetic order; the prefix common with the previous term is expressed compactly. • Tail (suffix) compression: Store terms to the point wherethey can be uniquely distinguished from other terms. • In fine-grained index: Instead of full pointers, store intervals ofsuccessive occurrences of a term. Compound queries: • AND: Retrieve pointer lists and compute their intersection. • OR: Retrieve pointer lists and form their union. • NOT: This is usually combined with AND, so that we can applyset difference to the pointer lists of terms. MMDB-3 J. Teuhola 2012
Data structures for inverted indexes (a) B+-tree, with index terms as keys and pointer lists as leaves. (b) Hash organization, preferably a dynamic version, e.g. - Linear hashing - Extendible hashing (c) Trie-structure: each node represents one character, and theterm is found by following a path from the root to a leaf. Problem: must be mapped to external storage In each case, the variable-length pointer lists can be either locally in the structure, or (preferably) detached. MMDB-3 J. Teuhola 2012
Algorithm for building an inverted index 1. For each document, gather the index terms, combined with pointers to the actual locations. The result is a big sequential file (called S). 2. Sort S by using e.g. external mergesort. 3. Combine the successive entries representing the same index term. 4. Build the index (B+-tree, hash table, ...) on the index terms and letthe leaf entries refer to the detached pointer lists, stored asvariable-length records. Assessment of inverted indexes: • A very effective access method for retrieval. • A very popular technique in practice for static document sets(in spite of the storage penalty). • Presumably the main retrieval tool in web search engines. MMDB-3 J. Teuhola 2012
d1 d2 d3 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 BE DO IS NOT OR TO Bitmap indexing • Another representation for the inverted index • Suitable for coarse and moderate-grained indexing • A bitmap is a matrix with a row for each index term, and a column for each document. Element <i, j> is 1, if term i occurs indocument j, otherwise 0. Example documents: d1. ”TO BE OR NOT TO BE” d2. ”TO BE IS TO DO” d3. ”DO BE DO BE DO” [Note: Normally these index terms would be stopwords.] MMDB-3 J. Teuhola 2012
Bitmap indexing (cont.) • Especially efficient for Boolean queries: AND, OR, NOT can be implemented directly in hardware (e.g. 64 bits in parallel). • Problem: High storage consumption (#terms #documents);the matrix is usually sparse. • Possible combined structure:Use an inverted index for the less frequent terms, anda bitmap for the more frequent terms. • One option: compression. E.g. run-length coding: Replace sequences of zeroes by their count (which gets close to the normal inverted index). Also the encoding of integers has to be decided. Example: Bitmap row = 001010000010001100000100... Run-length code = <2, 1, 5, 3, 0, 5, ...> MMDB-3 J. Teuhola 2012
Hierarchical compression of bit strings • Divide the string into equal-sized blocks, • Apply disjunction (OR) to the bits within each block, creating a higher-level bit: A block of zeroes generates a 0-bit, others a 1-bit. • The process is repeated on higher levels, recursively. • Advantage: Single bits are easily accessible, by studying one pathin the tree. • Compression: zero blocks need not be stored, at all.Most of the leaf blocks are usually zero blocks (sparse bitmap). 1 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0000 0010 0000 00111000 0000 0100 0000 0000 0000 0000 0000 0000 0000 0000 0000 MMDB-3 J. Teuhola 2012
Signature indexing • A probabilistic technique • Can be generalized to any objects characterized by a variable-size set of index terms or descriptors (also called features). • Signature is a bit-array, generated by hashing the index terms to indexes of the array. It is usually at least some hundreds of bits long. • Signatures are collected to a separate signature file, which is smaller than the whole space required by the documents. • The signature file acts as a filtering mechanism, to reduce the amount of actual data to be searched. • The structure enables partial-match retrieval (subset of terms match) • Queries can also be considered a kind of documents (collection of keywords), and be converted into signature form.The query signature Q is compared against document signatures Di.If 1-bits of Q are included in 1-bits of Di, then Di is a candidate result. • Signatures are approximate descriptions of documents. False drops must be eliminated by checking the actual match of all candidates. MMDB-3 J. Teuhola 2012
Signature indexing (cont.) Advantages of signature files: • Low storage consumption, compared to (fine-grained) inverted indices. Typical value is 10-20% of the primary database size • More convenient than multidimensional indexes (to be studied later) Signature generation methods: • Word signaturemethod: Each index term is hashed into a sparse bit pattern, and the patterns are concatenated to form the document signature.This method usually results in higher false drop probability, but preserves sequencing information of terms in documents. • Superimposed coding: Each index term produces a bit pattern of full signature size. The patterns are OR’ed to form the document signature. This is here the default method. How to minimize false drops? • The proportions of 0’s and 1’s should be equal and uniformly distributed. (Weight = number of 1-bits in the signature) MMDB-3 J. Teuhola 2012
Signature indexing: Example Document index terms: D1: database, object, programming, schema D2: algorithm, computer, programming D3: algorithm, data structure, programming Hashing: hash(“algorithm”) = 3 hash(“computer”) = 1 hash(“database”) = 7 hash(“data structure”) = 5 hash(“object”) = 6 hash(“programming”) = 4 hash(“schema”) = 1 Signatures: D1 = 10010110, D2 = 10110000, D3 = 00111000 Query: Documents about “algorithm” and “programming”? Query signature: Q = 00110000, matches with D2 and D3. MMDB-3 J. Teuhola 2012
Models for information retrieval from documents • Boolean model • Queries consist of search terms, conneted by Boolean operators (AND, OR, NOT) • All documents containing a match are retrieved – no ranking. • Vector space model • Search terms are given weights in documents • Documents are ranked based on the distance between query and term vectors • Probabilistic model • Estimation of the probability of relevance between query and document, based on the relevance probabilities of the terms. • Popular version of this approach: BM25 (Best Match 25). MMDB-3 J. Teuhola 2012
Vector-based document retrieval Principles: • Retrieval with less precise queries; not always exact-match. • Ranking the documents according to their ‘distance’ from the query. • Semantic correlations of terms should be taken into account. Concepts: • Synonymy: Different terms mean the same thing. • Polysemy: A single term has multiple meanings. • Weight of a term indicates its importance in a document. The weight could be the number of term occurrences in the document. • Measuring the goodness of retrieval:- Precision: Proportion of retrieved relevant documents, relative to the total number of retrieved documents. - Recall: Proportion of retrieved relevant documents, relative to the total number of relevant documents. • The user decides which documents are relevant and which are not. MMDB-3 J. Teuhola 2012
Precision and recall Precision = X/ A All documents Recall = X/ B A = retrieved X = hits B = relevant MMDB-3 J. Teuhola 2012
D1 D2 D3 2 1 2 0 1 3 0 1 0 1 0 0 1 0 0 2 2 0 BE DO IS NOT OR TO Matrix representation of the term & document sets • Matrix D: one row per term, one column (Di) per document • Generalization of bitmap • Element Dt,i denotes the weight of term t in document i.The weight could be e.g. the number of term occurrences(more advanced scheme later). • Column vectors characterize documents. Example documents: D1: ”TO BE OR NOT TO BE” D2: ”TO BE IS TO DO” D3: ”DO BE DO BE DO” D MMDB-3 J. Teuhola 2012
Comparing queries and documents • Queries can be regarded as documents, as well, with their own characterizing vector, built from query terms (but usually without weights). • Task: Find k documents, whose vectors are closest to the query vector. • Problem: How to measure the distance (or similarity) between documents or usually query & document)? • First attempt: Similarity(Q, Di) = Q • Di = tTermsQt Dt,i,i.e. the inner product of query and document weight vectors. • Example: Query = {TO, DO} (0, 1, 0, 0, 0, 1)Q D1 = (0, 1, 0, 0, 0, 1) (2, 0, 0, 1, 1, 2) = 2Q D2 = (0, 1, 0, 0, 0, 1) (1, 1, 1, 0, 0, 2) = 3 (best similarity)Q D3 = (0, 1, 0, 0, 0, 1) (2, 3, 0, 0, 0, 0) = 3 (best similarity) MMDB-3 J. Teuhola 2012
Giving weights to terms Problems: • If frequencies ft,i (term instances per document) are used as weights, general terms are overweighted • Long documents are favored over short ones, because they contain more terms. Applying Zipf’s law: • The weight of an index term t should be inversely proportional to its frequency among documents, i.e.wt = 1/nt , where nt = number of documents containing term t. • An attenuated (weakened) form for this weight:wt= log (1+N/nt), where N = number of documents. • This can be combined with the term frequency, givingDt,i = ft,i· log (1+N/nt)which is called TF*IDF rule (“term frequency times inverse document frequency”). MMDB-3 J. Teuhola 2012
Measuring distance / similarity Geometric distances in vector space: • Euclidean distance (similarity =inverse of distance): This measure discriminates long documents: Dt,iis large, Q small. • Cosine rule: The angle (cosine) between query and document vectors in space is a good measure of their distance (similarity). From vector algebra:The obtained similarity measure: • Combined with TF*IDF: MMDB-3 J. Teuhola 2012
Search engine architecture • Acquisition of web documents: By crawling using the links, and monitoring RSS feeds • The documents are stored, preprocesses to text, and indexed. • Preprocessing tasks: parsing, stopword removal, stemming, link extraction, classification, etc. • Indexing:collecting statistics, giving weight to terms, and building the index - inverted indexes are the common index type. • Query processing: preprocessing, retrieval, ranking, output, relevance feedback. • Main goals: relevance of query results, and efficient processing • Problem of scale: the amounts of documents and queries are huge.Distribution, parallelism, replication, compression ... Are needed. MMDB-3 J. Teuhola 2012
Ranking in Web search engines • Search steps: • Normal search by keywords resulting in ’hits’ • Ranking of the hit documents • Main difference between web pages and traditional sets of documents: (hyper)links. Links are the most important factor in ranking documents. • Other ranking factors: • Content relevance measure, • Number of visits • Estimated (formal) quality of the content, • Page loading time • Financial promotion MMDB-3 J. Teuhola 2012
HITS algorithm for ranking • Retrieved pages are given two scores: • Authority score (ai) represents how respected the page is, knowing the incoming links, and the scores of the referring pages. • Hub score (hi) measures the goodness of the page, in view of the scores of the pages to which it refers. • The scores are inter-related; one cannot be decided before the other! • Solution: Start with initial scores ai(0)and hi(0), and iterate: and for k = 1, 2, ... until scores stabilize • Normalization of score vectors must be done at each iteration • The process actually computes the dominanteigenvectors of MTM and MMT, where M is the adjacency matrix of the page-link graph. • Problem with HITS algorithm: Scores are computed during query processing – not precomputed. MMDB-3 J. Teuhola 2012
PageRank algorithm • The famous basis of Google’s ranking method – still essential! • Rank (importance) of a webpage is determined by the importance of the pages referring to it. • Network flow idea: if a page has importance r, and k outgoing links, importance ’flows’ equally to the referred pages, namely r/k for each • Iterative computation: where - L goes through the pages linking to P. - f(L) denotes the fanout (#out-pointers) of L • In matrix algebra: convergence to the left eigenvector of a transition matrix with entry 1/f(Pi) on row i at columns where P links to. • Difference from HITS: PageRank values are independent of queries • Problem with both: semantic correlations are not well considered. MMDB-3 J. Teuhola 2012
Latent semantic indexing (LSI) Problem with the plain term-document approach: • Terms have synonyms, and documents often resemble each other. • The term-document matrix is usually large and sparse (mostly 0). Latent semantic indexing (LSI): • Reduction of the search space (and the matrix) • Representation of terms and documents in ‘semantic space’, by deriving a set of uncorrelated factors (‘concepts’) Steps in LSI: 1. Create the weighted term-document matrix D. 2. Compute a singular valued decomposition (A, S, B) of D by splittingD into three matrices A, S, and B. 3. Reduce the size of matrices by eliminating insignificant rows/columns. 4. Store the matrices, using any of the available indexing techniques. MMDB-3 J. Teuhola 2012
LSI: Singular value decomposition (SVD) • Given any matrix D of size mn, it is possible to find matricesA, S, and B such that - D = A S BT - A is an orthogonalmm matrix, i.e. ATA = I - B is an orthogonal nn matrix, i.e. BTB = I - S is a diagonalmn matrix (called singular matrix) where nonzero elements are on the diagonal from top-left in non-increasing order D A S BT mm mn nn mn = MMDB-3 J. Teuhola 2012
LSI: Singular value decomposition (SVD) Idea: • Reduce r to a smaller value k, such that the least significant (bottom-right) elements of S are discarded, as well as the corresponding columns in A and rows in B. • As a result we need to store only the reduced matrices Ak (mk), Sk (kk), and Bk (kn). D Ak Sk BkT MMDB-3 J. Teuhola 2012
LSI usage • Query processing:Compute the transformed query vector Qk = QT AkSk-1and apply the vector similarity search to Qk using matrix Bk. • Update: Complicated; LSI can be recommended mainly for semi-static document collections. • Strength: LSI is able to identify concepts or patterns among terms, based on their co-occurrence in the documents.Semantic correlations are extracted from ‘noise’. • Note: LSI resembles Principal Component Analysis (PCA) – a well-known dimensionality-reduction method based on eigenvectors. MMDB-3 J. Teuhola 2012
Semi-structured documents: XML • XML = eXtensible Markup Language • Accepted 1998 by W3C (World Wide Web Consortium) • Simplified form of SGML (Standard Generalized Markup Language) • Document = tree structure (hierarchy) of elements • Element enclosed by start and end tags • Elements can be references to media objects • Elements can be further described by attributes (in the start tag) • Differences from HTML: • Logical content separated from physical layout • Extensible: new tags can be adopted according to need • XML has also other purposes than web publishing MMDB-3 J. Teuhola 2012
Example XML document books <?XML version=”1.0”?> <books> <book isbn=”123-456-789"> <title>Database systems</title> <authors> <author>Elmasri</author> <author>Navathe</author> </authors> </book> <book isbn=”987-654-321”> <title>Multimedia databases</title> <authors> <author>Dunckley</author> </authors> </book> </books> book* title authors author+ MMDB-3 J. Teuhola 2012
XML-based markup languages • Scalable Vector Graphics (SVG):Presentation of variable-size vector graphics on screen. • Office Open XML (OOXML; OpenXML):File format for representing office documents like (rich) text, spreadsheets, slide presentations, etc. • Web services:Applications that can communicate with other applications using standard protocols (http) over the Internet. • Mathematical Markup Language (MathML):Presentation of mathematical formulas. • Chemical Markup Language (CML): Presentation of chemical formulas. • and many others ... MMDB-3 J. Teuhola 2012
XML-based markup languages (cont.) Synchronized Multimedia Integration Language (SMIL): • Controls layout, interaction, operation and timing of multimedia presentations • Gathers the media files in the order that they should appear • Combines them into a single stream • Viewing by SMIL-enabled player (e.g. Ambulant 2.0) • W3C recommendation, see http://www.w3.org/AudioVideo/Latest official SMIL 2.1, latest proposal SMIL 3.0 (Dec. 2008) • Several media players support (e.g. RealPlayer; IE partially) • Tutorial: http://www.w3schools.com/smil/default.asp MMDB-3 J. Teuhola 2012
SMIL code example <smil> <head> <layout> <root-layout height=“250" width=“300" background-color="#ffffff" title=“Z"/> <region id="title" width=“200" height=“100" top="0" left="0" z-index=“1" /> <region id=“img" width="200" height=“150" top=“50" left=“40" z-index=“2" /> </layout> </head> <body> <seq> <text src="http://www. xxx/head.txt" region="title" begin="2.00s” end=“3.00s" /> <par> <text src="http://www.yyy/text.txt" region=“title" /> <img src="http://www.ttt/fig.gif" region=“img" begin="1.00s" end=“10.00s“ /> <audio src="http://www.zzz/music.rm" begin=“5.00s" end=“10.00s" /> </par> </seq> </body> </smil> MMDB-3 J. Teuhola 2012
XML and databases Some storage alternatives: • Normalized database + transformation of query results into XML.Advantage: Uniform presentation of database objects in heterogeneous, distributed and multi-tier database systems. • Storage of the XML code as an attribute value in the document table (together with document id and other separated search attributes). The XML attribute is logically unnormalized.Advantage: No transformation needed for viewing. • Storage of XML document elements and parent links as attributes in a relation. This needs careful indexing. • Native XML database:Requires a query language (interface) and indexing support. MMDB-3 J. Teuhola 2012
XQuery: XML query language • Developed by W3C (WWW Consortium), see http://www.w3.org/TR/xquery/ • XQuery 1.0, latest version January 2007. • Important consitutent: Path expressions (using Xpath) • Control: • looping (FOR) • variable binding (LET) • selection condition (WHERE) • creating result (RETURN) • Arithmetic and logical operators • Sorting; sequence processing • XQuery syntax alternatives: SQL- or XML-oriented MMDB-3 J. Teuhola 2012
XML query examples XPath: Find titles of articles with type ”draft”: collection(’articles’)/article[@type=”draft”]/title XQuery: Find authors of articles written in 2005 (join of two document collections): for $art in collection(’articles’)/article[@year=”2005”] let $author := collection(’authors’)/author[@id=$art/auth_id] return <result> <title> { $art/title } </title> <author> { $author/name } </author> <result> MMDB-3 J. Teuhola 2012
XML support in database management systems • Commercial DBMSs extended by XML and XQuery support: • IBM DB2 9 ‘Viper’ • Oracle 11g XML DB: • Microsoft SQL Server 2005 • Some ’native’ XML databases: • dbXML(open-source) • eXist(open-source) • xDB(commercial) MMDB-3 J. Teuhola 2012