140 likes | 378 Views
Document Indexing. Document indexing is the process of associating or tagging documents with different “search” terms. Content: Index construction Scaling index construction Sort-based index construction BSBI : Blocked sort-based Indexing. Sec. 4.2. Index construction. Steps:
E N D
Document Indexing Document indexing is the process of associating or tagging documents with different “search” terms Content: Index construction Scaling index construction Sort-based index construction BSBI: Blocked sort-based Indexing
Sec. 4.2 Index construction Steps: • Parse the documents and extract words. • Store extracted words with document-ID Doc 1 THERE are growing signs that Hurricane Andrew, unwelcome as it was for the devastated inhabitants of Florida and Louisiana. Doc 2 HURRICANE Andrew, claimed to be the costliest natural disaster in US history, yesterday smashed its way through the state of Louisiana. Fig: Sample Indexing
Sec. 4.2 Term Document Indexing
Sec. 4.2 Scaling index construction
Sec. 4.2 Sort-based index construction: some issues
Sec. 4.2 BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)
Sec. 4.2 Pseudo Code: BSB- Index Construction
Sec. 4.2 1 3 2 4 Applying Merge Sort • Can do binary merges, with a merge tree of log210 = 4 layers. • During each layer, read into memory runs in blocks of 10M, merge, write back. 2 1 Merged run. 3 4 Runs being merged. Disk
Sec. 4.2 Some Issues with Merge Sort based Indexing
Reference • Information Retrieval, 2008 Cambridge University Press.