Indexing method

Indexing method Data Warehousing Lab. M.S. 3 HyunSuk Jung 2003.9.30

목차 • Index in Lore • DataGuides • Index Fabric • Toxin

Index in Lore • Value Index, Vindex • Locates atomic object with certain value • Text Index, Tindex • locates string atomic values containing specific words or groups of words • Link Index, Lindex • locates parents parents of a specific objects • Path Index, Pindex • locates objects reachable via given labeled path

1. Vindex • Satisfy basic comparisons, e.g, =, < • Query takes a triple (l, op, v) • Return one or a set of objects • Example • Suppose: have Vindex for label section • Query: values > 15.00 with incoming edge section • Result: {&3, &4} 16

2. Tindex • For keyword search • Query takes two values: (w, l) • Return oid and posting: <o, n> • Example • Suppose: have Tindex for label section • Query: select objects contain word “index” with incoming edge section • Result: {<&3, 1>, <&4, 2>}

3. Lindex • Retrieve parents of an object via given label • Query takes “child” object c and a label l • Return all parents such that: there is an l-labeled edge from p to c • Example • Suppose: located all objects containing “index” via Tindex • Query: select parents objects via incoming edge section • Result: {&2}

4. Pindex • Search all objects reachable via path P • DataGuide: a dynamic structural summary of all possible paths • Store OIDs and statistics • Example • Query: select book.chapter.section • Result: {&3, &4}

DataGuides: Enabling Query Formulation and Optimization inSemistructured Databases Roy Goldman, Jennifer Widom VLDB 1997

Foundations(1/2) • Definition • Label path: object에서 시작해서 검색할 수 있는 dot으로 나눠지는 하나이상의 labels, l1.l2…ln ex)object1의 label path:Restaurant.name, Bar • Data path: l1.o1.l2.o2…ln.on ex)object1의 data path:Restaurant.2.name.5 • Target set:t={o|l1.o1.l2…ln.o}주어진 label path를 검색해서 도달되는 모든 object의 집합. ex)Restaurant.Entrée의 target set={6,10,11}

Foundations(2/2) • Database tree vs. schema tree

Concept of a DataGuide • Summary of label paths from the root (= simple paths) • Concise: describe every unique simple path exactly once, regardless of the number of times it appears • Accuracy: do not contains label paths that do not appear in the data • Convenience: can store and access it using similar techniques available for processing semistructured data

Notice • DataGuides contains no atomic values. • Since a DataGuide is intended to reflect the structure of a database, atomic values are unnecessary. • Every target set in a DataGuide is a singleton set. • Since any DataGuide label path has just one data path instance, the target set contains only one object.

Existance of Multiple DataGuides • Minimal DataGuides • (c) - smallest possible DataGuide, • minimal DataGuide가 항상 best는 아니다. • Incremental maintenance문제 • Annotation문제 x E E <Figure 3. A source and two DataGuides>

1 1 1 1 A A B A B A B A A B 2,4 6 2 4 6 2 4 6 2,4 6 C C C C C C C C C C C 3,5 3 5 3 5 3,5 5 Source Strong DataGuide Source Strong DataGuide Strong DataGuide • If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. • Linear time and linear space for tree structured data • Exponential time and exponential space for graph structured data

Incremental maintenance • Update하는 방법 • 점선으로 된 B edge를 추가하기 전의 DataGuide : (b) • 점선으로 된 B edge를 추가한 후의 DataGuide : (c) • B edge의 target set이 역시{2,3}이므로 (b)의 10번 노드가 사라지고 (c)에서 처럼 B edge도 9번 노드로 향하게 된다. <Insertion of an edge> Strong DataGuide

A Fast Index for Semistructured Data Brian F. Cooper, Neal Sample, Michael J. Franklin, Gísli R. Hjaltason, Moshe Shadmon VLDB 2001

Index Fabric • IndexFabric indexes both paths and content of tree databases in a balanced hierarchyu of Patricia Tries. • Trie & Patricia trie <Trie> <Patricia Trie> • 기존의 Trie를 string 압축을 통해 강화한 것이다. • lossy 압축:잘못된 matching 우려. Ex) inbox ->annotated IndexFabric으로 해결

Index Fabric • Tree Structured Data • Conceptual similar to strong DataGuide • Layered structure • Use Patricia trie to index a large number of search keys • The simple path of an element which has a data value is encoded as a special character sequence • Keeps the key which is the combination of encoded sequence and data value.

Indexing XML with the Index Fabric • Designator “IBNABC Corp” • Raw paths • Root-to-leaf까지의 경로를 스트링으로 압축하여 XML의 계층적 구조를 인덱스한다. Ex) <A>alpha<B>beta<C>gamma</C></B></A> • Root-to-leaf 경로의 3가지 경로 <A>alpha, <A><B>beta, <A><B><C>gamma <invoice> <buyer><name> ABC Corp </name></buyer> </invoice>

Indexing XML with the Index Fabric • Refined paths • Specialized paths through the XML that optimize frequently occurring access patterns. • Ex) “company X가 company Y에 판 invoice를 찾아라” 1. “Z”와 같은 designator를 할당한다. 2. 인덱스된 정보를 압축한다. 만약 “Acme Inc” 가 “ABC Corp”에 물건을 팔았다면 다음과 같은 키를 생성할 것이다. “Z ABC Corp Acme Inc” 3. 생성한 키를 fabric으로 삽입한다. <Sample XML>

Index Fabric vs. strong DataGuides (d): resulting index is too restricted, lacking references to part of the database. (C): compress atomic database content but not structure, stores node IDs in inner index nodes, too.

Features • Data representation • Patricia Trie indexing combined label and character strings. • Navigation • Top-down • Since no secondary context index is used, there is only a single combined look-up for structure and content. • Path templates • The pre-evaluated hits are inserted as refined paths into the same IndexFabric as the non-privileged raw paths. • They can be compared to materialied views on the document node reference

Experimental result • 같은 RDBMS상에서 자체 인덱싱방법과 Index Fabric을 비교 <Query> <Experimental results>

Experimental result • Query B, D의 결과 비교 <Query B: find conference paper by author.> <Query D: find publications by co-authors>

Conclusion • 실험결과에서 보듯이 Index Fabric은 RDBMS 자체의 인덱스보다 길고, 복잡한 string에 대해서 좋은 효과를 보여준다. • Index Fabric 은 많은 key도 잘 수용하며, key의 길이나 복잡도에 민감하지 않다.

Indexing XML Data with ToXin Flavio Rizzolo, Alberto Mendelzon WebDB2001

ToXin • [Rizzolo, Mendelzon: WebDB 01] • Tree Structured Data • Conceptually Similar to strong DataGuide (not minimal DataGuide) • Support navigation of forward and backward traversal • Path Tree ( = strong DataGuide) • A node of Path Tree has an Index Table or Value Tables • Index Table (IT): parent-child relationships • Value Table (VT): owner-value relationships

LibraryDB:IT book:IT paper:IT title:VT section title:VT chapter author:VT author:VT ToXin • Since ToXin keeps parent-child relationships, ToXin supports path expression with value predicates • ex) /libraryDB/book[author = author1] • Index Tables 0 LibararyDB parent child null 1 LibraryDB.book parent child 1 2 LibraryDB.paper parent child 1 6 1 2 3 • Value Tables • LibraryDB.book.author • parent value • author1 7 4 5 6 8 9 …

Indexing method

Indexing method

Presentation Transcript

Indexing

Indexing:

Indexing

Indexing

Indexing

A Hybrid Indexing Method for Approximate String Matching

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing

Indexing