1 / 34

n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure

VLDB 2005. n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure. Aug. 31, 2005 Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, and Min-Jae Lee Department of Computer Science Korea Advanced Institute of Science and Technology (KAIST). Contents. Introduction

conniebell
Download Presentation

n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VLDB 2005 n-Gram/2L: A Space and Time EfficientTwo-Level n-Gram Inverted Index Structure Aug. 31, 2005 Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, and Min-Jae Lee Department of Computer Science Korea Advanced Institute of Science and Technology (KAIST)

  2. Contents • Introduction • Motivation and Goals • Structure of the n-Gram/2L Index • Analysis of the n-Gram/2L Index • Performance Evaluation • Conclusions Dept. of Computer Science, KAIST

  3. Inverted Index • A term-oriented index structure for quickly searching documents containing a given term [BR1999] • Most actively used for text searching • Classification (depending on the kind of terms) [WMB1999] • Word-based inverted index • n-gram inverted index (simply, the n-gram index) the scope of this talk posting lists of terms B+-Tree index on terms … d: document identifier oi: offset where term t occurs in document d f: frequency of occurrence of term t in document d a posting d, [o1, …, of] Dept. of Computer Science, KAIST

  4. 5 0 1 2 8 9 ... ... A B C D D A B B C D 2-gram posting lists of 2-grams document 0 ... AB 0, [0, 5] 1, [1, 5] 2, [2, 8] 3, [3, 7] 4, [2, 6] 5, [4, 8] D A B C D A B C D A document 1 BB 0, [6] 2, [3] 5, [0] C D A B B C D D A B document 2 BC 0, [1, 7] 1, [2, 6] 2, [4] 3, [0, 4, 8] 4, [3, 7] 5, [1, 5] B C D A B C D A B C document 3 CD 0, [2, 8] 1, [3, 7] 2, [0, 5] 3, [1, 5] 4, [4, 8] 5, [2, 6] DA 0, [4] 1, [0, 4, 8] 2, [1, 7] 3, [2, 6] 4, [1, 5] 5, [3, 7] D D A B C D A B C D document 4 DD 0, [3] 2, [6] 4, [0] B B C D A B C D A B document 5 <2-gram inverted index> <The document collection> n-Gram Index • n-Gram • Definition: a string of fixed length n • Extraction method • Sliding a window of length n by one character in the text • Recording a sequence of characters in the window (We call it the 1-sliding technique) • Example Dept. of Computer Science, KAIST

  5. Pros and Cons of the n-Gram Index [BR1999,MM2003] • Pros • Language-neutral • Allowing us to disregard the characteristics of the language • Being widely used for Asian languages or DNA and protein databases • Error tolerant • Allowing us to retrieve documents with some errors in the query result • Being widely used for applications that allow errors (e.g., approximate matching) • Cons • The size tends to be large, and the query performance tends to be bad Dept. of Computer Science, KAIST

  6. o1 o2 2-gram posting list ... ... ... document 1 a1 a2 a3 a4 b1 b2 b3 b4 A B a1a2 1, [o1+0] 2, [o3+0] N, [o5+0] o3 a2a3 1, [o1+1] 2, [o3+1] N, [o5+1] ... ... a1 a2 a3 a4 a3a4 1, [o1+2] 2, [o3+2] N, [o5+2] document 2 b1b2 1, [o2+0] N, [o4+0] A b2b3 1, [o2+1] N, [o4+1] ... o4 o5 b3b4 1, [o2+2] N, [o4+2] ... ... ... b1 b2 b3 b4 a1 a2 a3 a4 document N B A <2-gram index> <The document collection> Motivation • We note that the large size of the n-gram index is due to the redundancy in the position information • If a subsequence is repeated multiple times in documents, the relative offsets (within the subsequences) of the n-grams extracted from that subsequence would also be indexed multiple times Dept. of Computer Science, KAIST

  7. o1 o2 posting list 2-gram ... ... ... document 1 a1 a2 a3 a4 b1 b2 b3 b4 posting list subsequence a1a2 A, [0] A B a2a3 A, [1] o3 a3a4 A, [2] A 1, [o1] 2, [o3] N, [o5] ... ... a1 a2 a3 a4 b1b2 B, [0] document 2 B 1, [o2] N, [o4] b2b3 B, [1] A b3b4 B, [2] ... o4 o5 ... ... ... b1 b2 b3 b4 a1 a2 a3 a4 document N The two-level construction of 2-gram index B A <The document collection> • We find out that the two-level construction eliminates that redundancy • If the relative offsets of n-grams extracted from a subsequence are indexed only once, the index size would be reduced since such repetition is eliminated Dept. of Computer Science, KAIST

  8. Goals • We propose the two-level n-gram inverted index (simply, n-gram/2L) • We show that the n-gram/2L index significantly reduces the index size and improves the query performance over the conventional n-gram index Dept. of Computer Science, KAIST

  9. <front-endindex> <back-endindex> posting lists of n-grams posting lists of m-subsequences B+-Tree on m-subsequences B+-Tree on n-grams … … a posting: a posting: d, [o1, …, of(d,s)] v, [o1, …, of(v,t)] v: m-subsequence identifier oi: offset where n-gram t occurs in m-subsequence v f(v,t): frequency of occurrence of n-gram t in m-subsequence v d: document identifier oi: offset where m-subsequence soccurs in document d f(d,s):frequency of occurrence of m-subsequence s in document d Structure of the n-Gram/2L Index • Two-level structure • Back-end index: storing the offsets of m-subsequences within documents • Front-end index: storing the offsets of n-grams within m-subsequences (m-subsequence: a subsequence of length m) Dept. of Computer Science, KAIST

  10. Building of the n-Gram/2L Index • Algorithm • Step 1 (back-end index) • Extracting m-subsequences from a set of documents such that consecutive subsequences overlap with each other by n-1 • Building the back-end index using the m-subsequences • Step 2 (front-end index) • Extracting n-grams from the set of m-subsequences • Building the front-end index using the n-grams Dept. of Computer Science, KAIST

  11. document n-1 n n-2 duplicated n-grams missed • Theorem 1:If m-subsequences are extracted such that consecutive ones overlap with each other by n-1, no n-gram is missed or duplicated Proof (sketch): m-subsequences Dept. of Computer Science, KAIST

  12. Query Processing Using the n-Gram/2L Index • Algorithm • Step 1 (front-end index) • Finding the m-subsequences that cover a query string by searching the front-end index • Step 2 (back-end index) • Finding the documents that have a set of m-subsequences {Si} containing the query string by searching the back-end index Dept. of Computer Science, KAIST

  13. C D D A S Q B B C D D A S A B B C A B B C S Q B C D D A B Q D D A B B C 1 2 4 3 • Definition 1: Cover ScoversQ if an m-subsequence S and a query string Q satisfy one of the following four conditions: • A suffix of S matches a prefix of Q • The whole string of S matches a substring of Q • A prefix of S matches a suffix of Q • A substring of S matches the whole string of Q • Example A C D D S C D D Q Dept. of Computer Science, KAIST

  14. Definition 2 (brief): Expand The expand function expands a sequence of overlapping character sequences into one character sequence • Definition 3: Contain A set of m-subsequences {Si} contains a query string Q if {Si} and Q satisfy the following condition: Let SlSl+1...Sm be a sequence of m-subsequences overlapping with each other in {Si}. A substring of expand(SlSl+1...Sm) matches the whole string of Q Dept. of Computer Science, KAIST

  15. Si Si+1 ... Sj ... for Len(Q)m Q ... case1: {Si, Si+1, ... Sj} containsQ. Sp Sk Sq ... ... ... Q ... for Len(Q)<m Q case2: {Sk} contains Q. case3: {Sp,Sq} contains Q. • Cases of containment Dept. of Computer Science, KAIST

  16. Lemma 1:A document that has a set of m-subsequences {Si} containing the query string Q includes at least one m-subsequence covering Q • Algorithm (revisited) • Step 1 (front-end index) • Finding the m-subsequences that cover a query string by searching the front-end index for retrieving candidate results satisfying the necessary condition • Step 2 (back-end index) • Finding the documents that have a set of m-subsequences {Si} containing the query string by searching the back-end index for refining candidate results A document d has a set of m-subsequences {Si} containingQ A document d has at least one m-subsequence coveringQ <A necessary condition> Dept. of Computer Science, KAIST

  17. Formalization of the n-Gram/2L Index • We observe that the redundancy in the position information existing in the n-gram index is caused by non-trivial MultiValued Dependencies (MVDs) • We show that the n-gram/2L index can be derived by eliminating that redundancy through relational decomposition to the Fourth Normal Form (4NF) Dept. of Computer Science, KAIST

  18. MVD MVD R R1 R2 X Y R-X-Y X Y X R-X-Y a1 b1 c1 a1 b1 a1 c1 a1 b2 c2 a1 b2 a1 c2 a1 b1 c2 a2 b3 a2 c3 decompose a1 b2 c1 a2 b4 a2 c4 a2 b3 c3 (4NF) a2 b3 c4 a2 b4 c3 a2 b4 c4 MultiValued Dependency (MVD) • Definition [Ull1988] Suppose we are given a relation schema R, and X and Y are subsets of R. X→→Y holds in R if whenever r is a relation for R, and  and  are two tuples in r, with [X] = [X] (that is,  and agree on the attributes of X), then r also contains tuples  and , where •  [X] =  [X] =  [X] = [X] •  [Y] =  [Y] and  [R-X-Y] =  [R-X-Y] •  [Y] =  [Y] and  [R-X-Y] =  [R-X-Y] • Non-trivial MVD: Y  X and X  Y  R • Example Dept. of Computer Science, KAIST

  19. Relational Representation for Theoretical Analysis • NDO relation • Converting the n-gram index so that obeys the First Normal Form (1NF) • Having three attributes N, D, and O • N: n-grams • D: document identifiers • O: offsets of n-grams within documents • SNDO1O2 relation • Adding a new attribute S and splitting the attribute O into two attributes O1 and O2 • Having five attributes S, N, D, O1, and O2 • S : m-subsequences in which n-grams appear • O1: offsets of n-grams withinm-subsequences • O2: offsets of m-subsequences withindocuments n-gram index NDO relation SNDO1O2 relation Dept. of Computer Science, KAIST

  20. N: n-grams D: document identifiers O: offsets Example of Relational Representation N D O AB 0 0 CD 0 2 AB 0 5 CD 0 8 AB 1 1 CD 1 3 AB 1 5 CD 1 7 AB 2 2 CD 2 0 AB 2 8 CD 2 5 2-gram posting lists of 2-grams AB 3 3 CD 3 1 document 0 A B C D D A B B C D AB 3 7 CD 3 5 D A B C D A B C D A document 1 AB 4 2 CD 4 4 AB 0, [0, 5] 1, [1, 5] 2, [2, 8] 3, [3, 7] 4, [2, 6] 5, [4, 8] AB 4 6 CD 4 8 BB 0, [6] 2, [3] 5, [0] C D A B B C D D A B document 2 AB 5 4 CD 5 2 BC 0, [1, 7] 1, [2, 6] 2, [4] 3, [0, 4, 8] 4, [3, 7] 5, [1, 5] B C D A B C D A B C document 3 AB 5 8 CD 5 6 CD 0, [2, 8] 1, [3, 7] 2, [0, 5] 3, [1, 5] 4, [4, 8] 5, [2, 6] BB 0 6 DA 0 4 D D A B C D A B C D document 4 DA 0, [4] 1, [0, 4, 8] 2, [1, 7] 3, [2, 6] 4, [1, 5] 5, [3, 7] BB 2 3 DA 1 0 BB 5 0 DA 1 4 DD 0, [3] 2, [6] 4, [0] B B C D A B C D A B document 5 BC 0 1 DA 1 8 BC 0 7 DA 2 1 <2-gram index> <The document collection> BC 1 2 DA 2 7 BC 1 6 DA 3 2 BC 2 4 DA 3 6 BC 3 0 DA 4 1 BC 3 4 DA 4 5 BC 3 8 DA 5 3 BC 4 3 DA 5 7 BC 4 7 DD 0 3 normalize BC 5 1 DD 2 6 BC 5 5 DD 4 0 <NDO relation (1NF) >

  21. S: m-subsequences O1: offsets of n-grams O2: offsets of m-subsequences MVD MVD S N O1 D O2 N D O AB 0 0 CD 0 2 ABCD AB 0 0 0 CDAB CD 0 1 3 AB 0 5 CD 0 8 ABCD AB 0 3 3 CDAB CD 0 2 0 AB 1 1 CD 1 3 ABCD AB 0 4 6 CDAB CD 0 5 6 AB 1 5 CD 1 7 ABCD BC 1 0 0 CDAB DA 1 1 3 AB 2 2 CD 2 0 ABCD BC 1 3 3 CDAB DA 1 2 0 AB 2 8 CD 2 5 ABCD BC 1 4 6 CDAB DA 1 5 6 AB 3 3 CD 3 1 ABCD CD 2 0 0 CDAB AB 2 1 3 AB 3 7 CD 3 5 ABCD CD 2 3 3 CDAB AB 2 2 0 AB 4 2 CD 4 4 ABCD CD 2 4 6 CDAB AB 2 5 6 AB 4 6 CD 4 8 BBCD BB 0 0 6 DABC DA 0 1 0 AB 5 4 CD 5 2 BBCD BB 0 2 3 DABC DA 0 3 6 AB 5 8 CD 5 6 BBCD BB 0 5 0 DABC DA 0 5 3 BB 0 6 DA 0 4 BBCD BC 1 0 6 DABC AB 1 1 0 BB 2 3 DA 1 0 BBCD BC 1 2 3 DABC AB 1 3 6 BB 5 0 DA 1 4 BBCD BC 1 5 0 DABC AB 1 5 3 BC 0 1 DA 1 8 BBCD CD 2 0 6 DABC BC 2 1 0 BC 0 7 DA 2 1 BBCD CD 2 2 3 DABC BC 2 3 6 BC 1 2 DA 2 7 BBCD CD 2 5 0 DABC BC 2 5 3 BC 1 6 DA 3 2 BCDA BC 0 1 6 DDAB DD 0 0 3 BC 2 4 DA 3 6 BCDA BC 0 3 0 DDAB DD 0 2 6 BC 3 0 DA 4 1 BCDA BC 0 4 3 DDAB DD 0 4 0 BC 3 4 DA 4 5 BCDA CD 1 1 6 DDAB DA 1 0 3 BC 3 8 DA 5 3 BCDA CD 1 3 0 DDAB DA 1 2 6 BC 4 3 DA 5 7 BCDA CD 1 4 3 DDAB DA 1 4 0 BC 4 7(1+6) DD 0 3 BCDA DA 2 1 6 DDAB AB 2 0 3 BC 5 1 DD 2 6 BCDA DA 2 3 0 DDAB AB 2 2 6 BC 5 5 DD 4 0 BCDA DA 2 4 3 DDAB AB 2 4 0 <SNDO1O2 relation> <NDO relation>  We see a Cartesian product of NO1 and DO2 in the SNDO1O2 relation

  22. Normalization of the n-Gram Index • Lemma 2: Non-trivial MVD’s S→→NO1 and S→→DO2 hold in the SNDO1O2relation Proof (sketch): • The set of documents, where an m-subsequence occurs, and the set of n-grams, which are extracted from that m-subsequence, are independent of each other • Due to this independence, there exist the tuples corresponding to all possible combinations of documents and n-grams for a given m-subsequence • Lemma 3:The decomposition (SNO1, SDO2) is in 4NF Proof: See the paper • Theorem 2: The 4NF decomposition (SNO1, SDO2) of the SNDO1O2 relation is identical to the front-end and back-end indexes of the n-gram/2L index Proof: See the paper Dept. of Computer Science, KAIST

  23. Example of Normalization Using Theorem 2 the m-subsequence identifier S D O2 S N O1 ABCD (0) 0 0 0 AB 0 ABCD (0) 3 3 3 AB 2 ABCD (0) 4 6 4 AB 1 BBCD (1) 0 6 5 AB 2 BBCD (1) 2 3 1 BB 0 BBCD (1) 5 0 0 BC 1 BCDA (2) 1 6 1 BC 1 BCDA (2) 3 0 2 BC 0 BCDA (2) 4 3 4 BC 2 CDAB (3) 1 3 0 CD 2 CDAB (3) 2 0 1 CD 2 CDAB (3) 5 6 2 CD 1 DABC (4) 1 0 3 CD 0 DABC (4) 3 6 2 DA 2 DABC (4) 5 3 3 DA 1 DDAB (5) 0 3 4 DA 0 denormalize denormalize DDAB (5) 2 6 5 DA 1 DDAB (5) 4 0 5 AB 0 <SNO1relation> <SDO2relation> 2-gram posting list posting list 4-subsequence AB 0, [0] 3, [2] 4, [1] 5, [2] ABCD 0, [0] 3, [3] 4, [6] BB 1, [0] BBCD 0, [6] 2, [3] 5, [0] BC 0, [1] 1, [1] 2, [0] 4, [2] BCDA 1, [6] 3, [0] 4, [3] CD 0, [2] 1, [2] 2, [1] 3, [0] CDAB 1, [3] 2, [0] 5, [6] DA 2, [2] 3, [1] 4, [0] 5, [1] DABC 1, [0] 3, [6] 5, [3] DD 5, [0] DDAB 0, [3] 2, [6] 4, [0] <The front-end index> <The back-end index>

  24. avgdoc ... an m-subsequence ... avgngram documents n-grams Analysis of the n-Gram/2L Index • Optimal length mo Length of the m-subsequence that minimizes the size of the n-gram/2L index • Notation Dept. of Computer Science, KAIST

  25. Index Size • Space complexities • n-gram index: O(avgdocavgngram) • n-gram/2L index: O(avgdoc+avgngram) • Properties • mo is obtained by finding the length m that makes avgdoc = avgngram • Both avgdoc and avgngram increase as the database size gets larger • Analytical results • Size of the n-gram/2L index is significantly reduced compared with that of the n-gram index for a large database • Reduction of the index size becomes more marked as the database size increases • See the paper for the detailed analysis Dept. of Computer Science, KAIST

  26. (kngram(s)  kdoc(s)) sizengram= (1) s  S  (2) kngram(s) sizefront= s  S  (3) kdoc(s) sizeback= s  S  (kngram(s)  kdoc(s)) sizengram s  S =  sizefront + sizeback (kngram(s) + kdoc(s)) s  S |S| (avgngram(S)  avgdoc(S))  (4) |S| (avgngram(S) + avgdoc(S)) • Formulas for the index size Dept. of Computer Science, KAIST

  27. Query Performance • Time complexities • n-gram index: O(avgdocavgngram) • n-gram/2L index: O(avgdoc+avgngram) • Analytical results • n-gram/2L index significantly improves the query performance over the n-gram index for a large database • Improvement of the query performance gets better as the database size increases • Query processing time increases only very slightly as the query length gets longer • It has been pointed out that the query performance of the n-gram index for long queries tends to be bad [Wil2003] • See the paper for the detailed analysis Dept. of Computer Science, KAIST

  28. sizengram ( Len(Q) – n + 1) timengram = (5)        m – Len(Q) m n m – n - i m – n - i n m sizefront ( Len(Q) – n + 1) timefront = (6) sizeback ),  ( m – n - 1 if Len(Q)  m  Len(Q) – m + 1 + 2 i = 0 (7) timeback = sizeback ),  ( Len(Q) – n - 1 if Len(Q) < m  (m –Len(Q) + 1 )  + 2 i = 0 sizengram ( Len(Q) – n + 1) if Len(Q)  m , Len(Q) – m + 1 ( ) (sizefront ( Len(Q) – n + 1)) + sizeback ( + c ) timengram (8) = timefront + timeback sizengram ( Len(Q) – n + 1) Len(Q) - n m – n   if Len(Q) < m , m - Len(Q) + 1 ( ) (sizefront ( Len(Q) – n + 1)) + sizeback ( + d )   1 1 i i m – n - 1 Len(Q) – n - 1 ( ( ) ) i = 0 i = 0 where c = 2 , d = 2   • Formulas for the query performance Dept. of Computer Science, KAIST

  29. the number of pages allocated for the n-gram index index size ratio = the number of pages allocated for the n-gram/2L index Experiments • Measures • Index size • Query performance • Number of page accesses • Wall clock time (ms) • Data sets • PROTEIN-DATA: the set of protein sequence databases used in bioinformatics • TREC-DATA: the set of English text databases used in information retrieval • Parameters • Data size = 10 MBytes, 100 MBytes, and 1 GBytes • n = 3 (n-gram length) [Kuk1992,WZ2002] • m = 4 ~ 6 (m-subsequence length) • Len(Q) = 3, 6, 9, 12, 15, and 18 (query length) Dept. of Computer Science, KAIST

  30. Index Size (PROTEIN-DATA) • Thesize of the n-gram/2L index is significantly reduced compared with that of the n-gram index • By up to 2.7 times in PROTEIN-1G • The reduction of index size become more marked as the database size increases • Approximately 25% for the PROTEIN-DATA as the database size is increased by ten fold (10 MBytes  100MBytes  1 GBytes) optimal length mo Dept. of Computer Science, KAIST

  31. Query Performance (PROTEIN-DATA) • n-gram/2L significantly improves the query performance over the n-gram index • Up to 13.1 timesin wall clock time (PROTEIN-1G) • Improvement gets better as the database size increases • 1.37 times in PROTEIN-100M; 6.65 times in PROTEIN-1G • Query processing time increasesonly very slightly as the query length gets longer • n-gram/2L index: 53%, Len(Q): 3  18 • (c.f. n-gram index: 32.9 times) <Query processing time> (Len(Q): 3~18) <No. of page accesses> (data set:PROTEIN-1G) <Query processing time> (data set:PROTEIN-1G) Dept. of Computer Science, KAIST

  32. Conclusions • We have shown that theredundancy in the position information existing in the n-gram index is due to non-trivial MVDs • We have proposed the two-level structure of the n-gram index • We have shown that the n-gram/2L index is derived by the relational normalization process that decomposes the n-gram index into 4NF • We have provided a formal analysis of the space and time complexities of n-gram/2L index • Finally, through extensive experiments, we have shown that the n-gram/2L significantly reduces the size and improves the query performance compared with the n-gram index Dept. of Computer Science, KAIST

  33. References [BR1999] Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999. [Coh1997] Jonathan D. Cohen, “Recursive Hashing Functions for n-Grams,” ACM Trans. on Information Systems, Vol. 15, No. 3, pp. 291-320, July 1997. [EN2003] Ramez Elmasri and Shamkant B. Navathe, Fundamentals of Database Systems, Addison Wesley, 4th ed., 2003. [Kuk1992] Karen Kukich, “Techniques for Automatically Correcting Words in Text,” ACM Computing Surveys, Vol. 24, No. 4, pp. 377-439, Dec. 1992. [LA1996] Joon Ho Lee and Jeong Soo Ahn, “Using n-Grams for Korean Text Retrieval,” In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Zurich, Switzerland, pp. 216-224, 1996. [MM2003] James Mayfield and Paul McNamee, “Single N-gram Stemming,” In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Toronto, Canada, pp. 415-416, July/Aug. 2003. [MSL+2000] Ethan Miller, Dan Shen, Junli Liu, and Charles Nicholas, “Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System,” Journal of Digital Information 1(5), pp. 1-25, Jan. 2000. [MZ1996] Alistair Moffat and Justin Zobel, “Self-indexing inverted files for fast text retrieval,” ACM Trans. on Information Systems, Vol. 14, No. 4, pp. 349-379, Oct. 1996. [Nav2001] Gonzalo Navarro, “A Guided Tour to Approximate String Matching,” ACM Computing Surveys, Vol. 33, No. 1, pp. 31-88, Mar. 2001. [Ram1998] Raghu Ramakrishnan, Database Management Systems, McGraw-Hill, 1998. Dept. of Computer Science, KAIST

  34. [SKS2001] Abraham Silberschatz, Henry F. Korth, and S. Sudarshan, Database Systems Concepts, McGraw-Hill, 4th ed., 2001. [SWY+2002] Falk Scholer, Hugh E. Williams, John Yiannis and Justin Zobel, “Compression of Inverted Indexes for Fast Query Evaluation,” In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Tampere, Finland, pp. 222-229, Aug. 2002. [Ull1988] Jeffery D. Ullman, Principles of Database and Knowledge-Base Systems Vol. I, Computer Science Press, USA, 1988. [Wil2003] Hugh E. Williams, “Genomic Information Retrieval,” In Proc. the 14th Australasian Database Conferences, 2003. [WLL+2005] Kyu-Young Whang, Min-Jae Lee, Jae-Gil Lee, Min-soo Kim, and Wook-Shin Han, “Odysseus:a High-Performance ORDBMS Tightly-Coupled with IR Reatures,” In Proc. the 21th IEEE Int'l Conf. on Data Engineering (ICDE), Tokyo, Japan, Apr. 2005. [WMB1999] I. Witten, A. Moffat, and T. Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers, Los Altos, California, 2nd ed., 1999. [WVT1990] Kyu-Young Whang, Brad T. Vander-Zanden, and Howard M. Taylor, “A Linear-Time Probabilistic Counting Algorithm for Database Applications,” ACM Trans. on Database Systems, Vol. 15, No.2, pp. 208-229, June 1990. [WZ2002] Hugh E. Williams and Justin Zobel, “Indexing and Retrieval for Genomic Databases,” IEEE Trans. on Knowledge and Data Engineering, Vol. 14, No. 1, pp. 63-78, Jan./Feb. 2002. [YT1998] Ogawa Yasushi and Matsuda Toru, “Optimizing query evaluation in n-gram indexing,” In Proc. Int'l Conf. on Information Retrieval, ACM SIGIR, Melbourne, Australia, pp. 367-368, 1998. Dept. of Computer Science, KAIST

More Related