1 / 7

Who Needs All Those Indexes ? One is Enough

Who Needs All Those Indexes ? One is Enough. Bruce Lindsay IBM Almaden Research Center bgl@almaden.ibm.com. Stored Data is Heterogeneous. Most stored data is NOT well structured Text & Semi-structured Sparse, multi-valued, & multi-occurrence attributes

Download Presentation

Who Needs All Those Indexes ? One is Enough

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Who Needs All Those Indexes ?One is Enough Bruce Lindsay IBM Almaden Research Center bgl@almaden.ibm.com

  2. Stored Data is Heterogeneous • Most storeddata is NOT well structured • Text & Semi-structured • Sparse, multi-valued, & multi-occurrence attributes • Much value latent in un-structured data • Text analytic tools can extract value • Beyond the words: names, roles, concepts, … • Text analytics: searching for meaning in the content • Semantic & knowledge driven analysis • Expensive: big dictionaries, byte-by-byte, big inputs and outputs • Stateless  easy scale-out

  3. Text Analytics Object analytic1 analytic2 to Index • Derive {<attribute, value, position>} from inputs • Language, words (stems, part-of-speech, …) • Context (title, bold, anchor text, …) • Concepts (person, organization, role, product, …) • Classification (complaint, fraud, spam, xxx, …) • Meta-data (to/from, subject, date, title, abstract, reference, …) • Domain and customer specific analysis offer most value • Analytics produced attributes induce index schema Data Source Dictionary Attributes/ Values Attributes/ Values

  4. Text Indexing • Logical index over<attribute, value, object, position> • MANY entries per object • Large index – even with aggressive compression • Non-transactional • Scale-out needed • Capacity - single index too big for one (commodity) node • Ingest thruput – concurrent insert to index fragments • Query response – fan-out / in for query parallelism • Query • Predicates over <attribute, value> matches • Match scoring – magic weighting of predicate importance & position • Query planning & optimization probably needed

  5. What about Data Processing?select / project / join / aggregate • Add “value” postings to index for keys and measures<‘attrVal’, attribute, object, value> • Select: {<attr1, val1>}   {obj1} • Project: {<‘attrVal’, keyAttr, obj1>}   {val2} • Join: {<keyAttr, val2>}   {obj2} • Project: {<‘attrVal’, measAttr, obj2>}   {measVal} • Aggregation: sum({measVal})

  6. Analytics Analytics IndexFragment IndexFragment Architecture Obj  storeMgr Indexer …scale-out… Analytics Query  queryPlanner queryDriver  ranked results ObjStore Obj Indexer Obj Queue Obj file file file

  7. Conclusions • Derived value from un-structured objects • Much value latent in un-structured data • Value extracted via analytic tools • Value captured in scalable index • Value exploited via query and data processing • Architecture • Index independent object store schema • Application choice of object analytics induces index schema • Scaled-out analytics and index

More Related