1 / 12

Document Indexing

Document Indexing. Document indexing is the process of associating or tagging documents with different “search” terms. Content: Index construction Scaling index construction Sort-based index construction BSBI : Blocked sort-based Indexing. Sec. 4.2. Index construction. Steps:

gaia
Download Presentation

Document Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Indexing Document indexing is the process of associating or tagging documents with different “search” terms Content: Index construction Scaling index construction Sort-based index construction BSBI: Blocked sort-based Indexing

  2. Sec. 4.2 Index construction Steps: • Parse the documents and extract words. • Store extracted words with document-ID Doc 1 THERE are growing signs that Hurricane Andrew, unwelcome as it was for the devastated inhabitants of Florida and Louisiana. Doc 2 HURRICANE Andrew, claimed to be the costliest natural disaster in US history, yesterday smashed its way through the state of Louisiana. Fig: Sample Indexing

  3. Sec. 4.2 Term Document Indexing

  4. Sec. 4.2 Scaling index construction

  5. Sec. 4.2 Sort-based index construction: some issues

  6. Sec. 4.2 BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)

  7. Sec. 4.2

  8. Sec. 4.2 Pseudo Code: BSB- Index Construction

  9. Analysis of BSBI

  10. Sec. 4.2 1 3 2 4 Applying Merge Sort • Can do binary merges, with a merge tree of log210 = 4 layers. • During each layer, read into memory runs in blocks of 10M, merge, write back. 2 1 Merged run. 3 4 Runs being merged. Disk

  11. Sec. 4.2 Some Issues with Merge Sort based Indexing

  12. Reference • Information Retrieval, 2008 Cambridge University Press.

More Related