1 / 35

File organization

File organization. File Organizations (Indexes). Choices for accessing data during query evaluation Scan the entire collection Typical in early (batch) retrieval systems Computational and I/O costs are O (characters in collection) Practical for only “small” text collections

tanika
Download Presentation

File organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. File organization

  2. File Organizations (Indexes) • Choices for accessing data during query evaluation • Scan the entire collection • Typical in early (batch) retrieval systems • Computational and I/O costs are O(characters in collection) • Practical for only “small” text collections • Large memory systems make scanning feasible • Use indexes for direct access • Evaluation time O(query term occurrences in collection) • Practical for “large” collections • Many opportunities for optimization • Hybrids: Use small index, then scan a subset of the collection

  3. Indexes • What should the index contain? • Database systems index primary and secondarykeys • This is the hybrid approach • Index provides fast access to a subset of database records • Scan subset to find solution set • IR Problem: • Cannot predict keys that people will use in queries • Every word in a document is a potential search term • IR Solution: Index by all keys (words) full text indexes

  4. Indexes • Index is accessed by the atoms of a query language, The atoms are called “features” or “keys” or “terms” • Most common feature types: • – Words in text, punctuation • – Manually assigned terms (controlled and uncontrolled vocabulary) • – Document structure (sentence and paragraph boundaries) • – Inter- or intra-document links (e.g., citations) • Composed features • – Feature sequences (phrases, names, dates, monetary amounts) • – Feature sets (e.g., synonym classes) • Indexing and retrieval models drive choices • – Must be able to construct all components of those models • Indexing choices (there is no “right” answer) • Tokenization,Case folding (e.g., New vs new, Apple vs apple), Stopwords (e.g., the, a, its), Morphology (e.g., computer, computers, computing, computed) • Index granularity has a large impact on speed and effectiveness

  5. Index Contents • The contents depend upon the retrieval model • Feature presence/absence • Boolean • Statistical (tf, df, ctf, doclen, maxtf) • Often about 10% the size of the raw data, compressed • Positional • Feature location within document • Granularities include word, sentence, paragraph, etc • Coarse granularities are less precise, but take less space • Word-level granularity about 20-30% the size of the raw data,compressed

  6. Indexes: Implementation • Common implementations of indexes • Bitmaps • Signature files • Inverted files • Common index components • Dictionary (lexicon) • Postings • document ids • word positions No positional data indexed

  7. Indexes: Bitmaps • Bag-of-words index only • For each term, allocate vector with one bit per document • If feature present in document n, set nth bit to 1, otherwise 0 • Boolean operations very fast • Space efficient for common terms (why?) • Space inefficient for rare terms (why?) • Good compression with run-length encoding (why?) • Not widely used

  8. Indexes: Signature Files • Bag-of-words only • For each term, allocate fixed size s-bit vector (signature) • Define hash function: • Multiple functions: word → 1..s [selects which bits to set] • Each term has an s-bit signature • may not be unique! • OR the term signatures to form document signature • Long documents are a problem (why?) • Usually segment them into smaller pieces / blocks

  9. Signature File Example

  10. Signature File Example

  11. Indexes: Signature Files • At query time: • Lookup signature for query (how?) • If all corresponding 1-bits are “on” in document signature, document probablycontains that term • Vary s to control P (false positive) • Note space tradeoff • Optimal s changes as collection grows (why?) • Widely studied but Not widely used

  12. Indexes: Inverted Lists • Inverted lists are currently the most common indexing technique • Source file: collection, organized by document • Inverted file: collection organized by term • one record per term, listing locations where term occurs • During evaluation, traverse lists for each query term • OR: the union of component lists • AND: an intersection of component lists • Proximity: an intersection of component lists • SUM: the union of component lists; each entry has a score

  13. Inverted Files

  14. Inverted Files

  15. Word-Level Inverted File

  16. How big is the index? • For an n word collection: • Lexicon • Heaps’ Law: V = O(nβ), 0.4 < β< 0.6 • TREC-2: 1 GB text, 5 MB lexicon • Postings • at most, one per occurrence of the word in the text: O(n)

  17. Inverted Search Algorithm • Find query elements (terms) in the lexicon • Retrieve postings for each lexicon entry • Manipulate postings according to the retrieval model

  18. Word-Level Inverted File lexicon posting Query: 1.porridge & pot (BOOL)2.“porridge pot” (BOOL) 3. porridge pot (VSM) Answer

  19. Lexicon Data Structures According to Heaps’ Law, the size of lexicon maybe very large • Hash table • O(1) lookup, with constant h() and collision handling • Supports exact-match lookup • May be complex to expand • B-Tree • On-disk storage with fast retrieval and good caching behavior • Supports exact-match and range-based lookup • O(log n) lookups to find a list • Usually easy to expand

  20. In-memory Inversion Algorithm • Create an empty lexicon • For each document d in the collection, • Read document, parse into terms • For each indexing term t, • fd,t = frequency of t in d • If t is not in lexicon, insert it • Append <d, fd,t> to postings list for t • Output each postings list into inverted file • For each term, start new file entry • Append each <d,fd,t> to the entry • Compress entry • Write entry out to file.

  21. Complexity of In-memory Inv. • Time: O(n) for n-byte text • Space • Lexicon: space for unique words + offsets • Postings, 10 bytes per entry • document number: 4 bytes • frequency count: 2 bytes (allows 65536 max occ) • "next" pointer: 4 bytes • Is this affordable? • For 5GB collection, at 10 bytes/entry, for 400M entries, need 4GB of main memory

  22. Main inverted file 多路归并 chunk Idea 1: Partition the text • Invert a chunk of the text at a time • Then, merge each sub-indexes into one complete index

  23. Idea 2: Sort-based Inversion Invert in two passes: • Output records <t, d, ft> to a temp. file • Sort the records using external merge sort • read a chunk of the temp file • sort it using Quicksort • write it back into the same place • then merge-sort the chunks in place • Read sorted file, and write inverted file

  24. Access Optimizations • Skip lists: • A table of contents to the inverted list • Embedded pointers that jump ahead n documents (why is this useful?) • Separating presence information from location information • Many operators only need presence information • Location information takes substantial space (I/O) • If split, • reduced I/O for presence operators • increased I/O for location operators (or larger index) • Common in CD-ROM implementations

  25. Inverted file compression • Inverted lists are usually compressed • Inverted files with word locations are about the size of the raw data • Distribution of numbers is skewed • Most numbers are small (e.g., word locations, term frequency) • Distribution can be made more skewed easily • Delta encoding: 5, 8, 10, 17 → 5, 3, 2, 7 • Simple compression techniques are often the best choice • Simple algorithms nearly as effective as complex algorithms • Simple algorithms much faster than complex algorithms • Goal: Time saved by reduced I/O > Time required to uncompress Compress  Huffman encoding

  26. Inverted file compression • The longst lists, which take up the most space, have the most frequent (probable) words. • Compressing the longest lists would save the most space. • The longest lists should compress easily because they contain the least information (why?) • Algorithms: • Delta encoding • Variable-length encoding • Unary codes • Gamma codes • Delta codes • Variable-Byte Code • Golomb code

  27. Inverted List Indexes: Compression Delta Encoding ("Storing Gaps") • Reduces range of numbers. • keep d in ascending order • 3, 5, 20, 21, 23, 76, 77, 78 • becomes: 3, 2, 15, 1, 2, 53, 1, 1 • Produces a more skewed distribution. • Increases probability of smaller numbers. • Stemming also increases the probability of smaller numbers. (Why?)

  28. Variable-Byte Code • Binary, but use minimum number of bytes • 7 bits to store value, 1 bit to flag if there is another byte • 0 < x < 128: 1 byte • 128 < x < 16384: 2 bytes • 16384 < x < 2097152: 3 bytes • Integral byte sizes for easy coding • Very effective for medium-sized numbers • A little wasteful for very small numbers • Best trade-off between smallest index size and fastest query time

  29. Index/IR toolkits • The Lemur Toolkit for Language Modeling and Information Retrieval • Java-based indexing and search technology,provide web search application software http://www.lemurproject.org/ http://lucene.apache.org/nutch/

  30. Summary • Common implementations of indexes • Bitmap, signature file, inverted file • Common index components • Lexicon & postings • Inverted Search Algorithm • Inversion algorithm • Inverted file access optimizations • compression

  31. The End

  32. Vector Space Model • 文档d和查询q在向量空间中表示为两个m维向量,每维度的权值用TF∙IDF,其相似度用向量夹角余弦度量,有: (使用原始的tf,idf公式)

  33. Vocabulary Growth (Heap’s Law) • How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? • Vocabulary has no upper bound due to proper names, typos, etc. • New words occur less frequently as vocabulary grows • If V is the size of the vocabulary and the n is the length of the corpus in words: • V = Knβ (0< β <1) • Typical constants: • K ≈ 10−100 • β ≈ 0.4−0.6 (approx. square-root of n)

  34. Query Answer • 1.porridge & pot (BOOL) • d2 • 2.“porridge pot” (BOOL) • null • 3. porridge pot (VSM) • d2 > d1>d5 • Next page

  35. 3. porridge pot (VSM) N=6,f(d1)表示d1内词频 back

More Related