70 likes | 139 Views
Who Needs All Those Indexes ? One is Enough. Bruce Lindsay IBM Almaden Research Center bgl@almaden.ibm.com. Stored Data is Heterogeneous. Most stored data is NOT well structured Text & Semi-structured Sparse, multi-valued, & multi-occurrence attributes
E N D
Who Needs All Those Indexes ?One is Enough Bruce Lindsay IBM Almaden Research Center bgl@almaden.ibm.com
Stored Data is Heterogeneous • Most storeddata is NOT well structured • Text & Semi-structured • Sparse, multi-valued, & multi-occurrence attributes • Much value latent in un-structured data • Text analytic tools can extract value • Beyond the words: names, roles, concepts, … • Text analytics: searching for meaning in the content • Semantic & knowledge driven analysis • Expensive: big dictionaries, byte-by-byte, big inputs and outputs • Stateless easy scale-out
Text Analytics Object analytic1 analytic2 to Index • Derive {<attribute, value, position>} from inputs • Language, words (stems, part-of-speech, …) • Context (title, bold, anchor text, …) • Concepts (person, organization, role, product, …) • Classification (complaint, fraud, spam, xxx, …) • Meta-data (to/from, subject, date, title, abstract, reference, …) • Domain and customer specific analysis offer most value • Analytics produced attributes induce index schema Data Source Dictionary Attributes/ Values Attributes/ Values
Text Indexing • Logical index over<attribute, value, object, position> • MANY entries per object • Large index – even with aggressive compression • Non-transactional • Scale-out needed • Capacity - single index too big for one (commodity) node • Ingest thruput – concurrent insert to index fragments • Query response – fan-out / in for query parallelism • Query • Predicates over <attribute, value> matches • Match scoring – magic weighting of predicate importance & position • Query planning & optimization probably needed
What about Data Processing?select / project / join / aggregate • Add “value” postings to index for keys and measures<‘attrVal’, attribute, object, value> • Select: {<attr1, val1>} {obj1} • Project: {<‘attrVal’, keyAttr, obj1>} {val2} • Join: {<keyAttr, val2>} {obj2} • Project: {<‘attrVal’, measAttr, obj2>} {measVal} • Aggregation: sum({measVal})
Analytics Analytics IndexFragment IndexFragment Architecture Obj storeMgr Indexer …scale-out… Analytics Query queryPlanner queryDriver ranked results ObjStore Obj Indexer Obj Queue Obj file file file
Conclusions • Derived value from un-structured objects • Much value latent in un-structured data • Value extracted via analytic tools • Value captured in scalable index • Value exploited via query and data processing • Architecture • Index independent object store schema • Application choice of object analytics induces index schema • Scaled-out analytics and index