1 / 24

Highly Scalable Algorithm for Distributed Real-time Text Indexing

Highly Scalable Algorithm for Distributed Real-time Text Indexing. Ankur Narang, Vikas Agarwal, Monu Kedia, Vijay Garg IBM Research- India. Email: { annarang, avikas, monkedia}@in.ibm.com , garg@ece.utexas.edu. Agenda. Background Challenges in Scalable Indexing

lassie
Download Presentation

Highly Scalable Algorithm for Distributed Real-time Text Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Highly Scalable Algorithm for Distributed Real-time Text Indexing Ankur Narang, Vikas Agarwal, Monu Kedia, Vijay Garg IBM Research- India. Email: {annarang, avikas, monkedia}@in.ibm.com, garg@ece.utexas.edu

  2. Agenda • Background • Challenges in Scalable Indexing • In-memory Index data structure design • Parallel Indexing Algorithm • Parallel Pipelined Indexing • Asymptotic Time Complexity Analysis • Experimental Results • Strong Scalability • Weak Scalability • Search Performance • Conclusions & Future Work

  3. Background • Data Intensive Supercomputingis gaining strong research momentum • Large scale computations over massive and changing data sets • Multiple Domains: Telescope imagery, online transaction records, financial markets, medical records, weather prediction • Massive throughput real-time text indexing and search • Massive data at high rate ~ 1-10 GB/s • Index expected to age-off at regular intervals • Architectural Innovations • Massively parallel / many-core architectures • Storage class memories with 10s of tera-bytes of storage • Requirement for very high indexing rates and stringent search response time • Optimizations needed to • Maximize Indexing Throughput • Minimize Indexing Latency from indexing to search (per document) • Sustained search performance

  4. Background – Index for Text Search (e.g. Lucene) • Lucene Index Overview • A Lucene index covers a set of documents • A document is a sequence of fields • A field is a sequence of terms • A term is a text string • A Lucene index consists of one or more segments • Each segment covers a set of documents • Each segment is a fully independent index • Index updates are serialized • Multiple index searches can proceed concurrently • Supports simultaneous index update and search

  5. Challenges to Scalable In-memory Distributed Real-time Indexing • Scalability Issues in Typical Approaches • Mergesort of sorted terms in the input segments to generate the list of Terms and TermInfos for the MergedSegment • Merging and Re-organization of the Document-List and Position-List of the input segments • Load Imbalance increase with increase in number of processors • Index-Merge process quickly becomes the bottleneck • Large indexing latency • Index Data Structure Design Challenge • Inherent trade-offs in index-size vs. indexing throughput vs. search throughput • Trade-off in indexing latency vs. search response time vs. throughput • Performance Objective • Maximize Indexing performance while sustaining the search performance (including search response time and throughput)

  6. Scalability Issues With Typical Indexing Approaches Position-List : p31,p32, p41,p42,p43 Position-List : p11,p12, p21,p22,p23 Document-List : Doc(1’) / Freq(1’), Doc(2’) / Freq(2’) Document-List : Doc(1) / Freq(1), Doc(2) / Freq(2) Segment(1) Segment(2) TermInfo(T(i)) TermInfo(T(i)) Term (T(i)) Term (T(i)) Merged Segment Step(1) : Merge- Sort Of Terms & Creation of new TermInfo Step(2) : Merge of Document-Lists and Position-Lists Term(T(i)) Document-List : Doc(1)/F1,Doc(2)/F2, Doc(3)/F3, Doc(4)/F4 TermInfo(T(i)) Positions List : p11,p12,p21,p22,p23, p31,p32,p41,p42,p43

  7. In-memory Indexing Data Structure Design • Two-level hierarchical index data structure design • Top-level hash table: GHT (Global Hash Table) • Represents complete index for a given set of documents • Map: Terms => Second-level hash table(IHT) • Second level hash table: IHT (Interval Hash Table) • Represents index for an interval of documents with contiguous IDs • Map: Term => list of documents containing that term • Postings data also stored • Advantages of the design • No need for re-organization of data while merging IHT into GHT • Merge-sort eliminated using hash-operations • Efficient encoding of IHT reduces memory requirements of an IHT

  8. Hash Table Term Collision Resolution . . . Ti . . . . Ti : HF(Ti) Di Di+1 . . . . . Dj DocID, Frequency, Positions Array IHT Data Interval Hash Table (IHT) : Concept

  9. Hash Table Term Collision Resolution . . . Ti . . . . Ti : HF(Ti) Document-interval Indexed Hash Table . . . Dj-Dk Dj Dj+1 . . . . . Dk DocID, Frequency, Positions Array Global Hash Table (GHT) : Concept

  10. Size of each sub-array # Hash table entries # Distinct terms in IHT # Docs/term * # terms # Docs/term * #terms # Distinct terms in IHT #Docs/term * #terms What each sub-array represents Document IDs per term Term IDs Number of Docs in which each term occurred Offset into Position Information Term frequency in each Document Number of Distinct terms in each Hash table entry Steps to Access Term-Positions from (TermID(Ti), docID(Dj)) Get NumTerms From TermKey(Ti) GetTermID(Ti) GetNum-Docs(Ti) Get Offset Into Position Data(Ti,Dj) GetDocIDs(Ti) GetNumTerms(Dj) Encoded IHT representation

  11. New Indexing Algorithm • Main Steps of the Indexing Algorithm • Posting table (LHT) for each document is constructed without involving sorting of terms • Posting tables of k documents are merged into an IHT which are then encoded appropriately • Encoded IHTs are merged into a single GHT in an efficient manner

  12. Global Hash Table . . . Ti HF(Ti) S2(a) IHT(g) New Encoded IHT(g) . . . Tj HF(Tj) . . . Ti Tj S1 Distinct terms S2(b) IHT(g) Encoded IHT array Array Of IHTs GHT Construction from IHT

  13. Parallel Group-based Indexing Algorithm Documents Index Groups Indexing Group I1 I0 I2 Search Group Documents Query I4 I3 Documents Documents Documents

  14. Pipeline Diagram Produce IHTs/segments Merge IHTs/segments Send IHTs/segments Consumer Barrier Sync. Producer(3) Producer(2) Producer(1) Time (Distributed Indexing Algorithm)

  15. Asymptotic Time Complexity Analysis • Definitions • Size of the indexing group: |G| = (|P|+ 1) • P: set of Producers • Single Consumer • “n” Produce-Consume rounds • |P|Producers and single Consumer in each round • Prod(j,i):total time for jthProducer, in ithround. • ProdComp(j,i): compute time • ProdComm(j,i): communication time • Cons(i): total time for the Consumer in the ith round • ConsComp(i): compute time • ConsComm(i): communication time • Distributed Indexing • T(distributed) = X + Y + Z • X= maxjProdComp(j,1) • Y= Σ2≤i≤n max( maxjProd(j,i), Cons(i−1)) • Z= Cons(n)

  16. Asymptotic Time Complexity Analysis • Overall Indexing Time: dependent on balance of pipeline stages • 2 cases • Produce phase dominates merge phase • Merge phase dominates the compute phase • Time complexity Upper Bounds • Case(1) : Production time per round > Merge time per round • T(Pghtl) = O(R/|P|) • T(Porgl) = O((R/|P|) * log(k)) • Case(2) : Merge time per round > Production time per round • T(Pghtl) = O(R/k) • T(Porgl) = O((R/k) * log(|P|))

  17. Experimental Setup • Original CLucene codebase (v0.9.20) • Porgl implementation • Distributed in-memory indexing algorithm using RAMDirectory • Distributed Search Implementation • Pghtl implementation • Implementation of IHT and GHT data structures • Distributed Indexing and Search Algorithm Implementation • IBM Intranet website data • Text data extracted from HTML files • Loaded equally into the memory of the producer nodes • Experiments run on Blue Gene/L • Upto 16K Processor nodes (PPC 440) • 2 PPC 440 per node • Co-processor mode: 1 compute, 1 router • High Bandwidth 3D torus interconnect • For Porgl • “k” such that only one segment is created from all the text data fed to a Producer so as to get its best indexing throughput.

  18. Strong Scalability Comparison: Pghtl vs Porgl

  19. SpeedUp Comparison: Pghtl vs Porgl

  20. Weak Scalability Comparison: Pghtl vs Porgl

  21. Scalability With Data Size: Pghtl vs Porgl

  22. Indexing Latency Variation: Pghtl vs. Porgl

  23. Search Performance Comparison(Single Index Group)

  24. Conclusions & Future Work • High throughput text indexing demonstrated for the first time at such a large scale. • Architecture independent design of new data structures • Algorithm for distributed in-memory real-time group-based text indexing • better load balance, low communication cost & good cache performance. • Proved analytically: parallel time complexity of our indexing algorithm is at least (log(P)) better asymptotically compared to typical indexing approaches. • Experimental Results • 3× - 7× improvement in indexing throughput and around 10× better indexing latency on Blue Gene/L. • Peak indexing throughput of 312 GB/min on 8K nodes • Estimate: 5 TB/min on 128K nodes • Future Work: Distributed Search Optimizations

More Related