180 likes | 376 Views
Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach. Charles Yu Database Architect Elance Inc. cyu@elance.com charles.yu@acm.org 2005-08. Background.
E N D
Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. cyu@elance.com charles.yu@acm.org 2005-08
Background • Node-Uniform Hierarchical (NUH for short) data can be visualized as a tree or forest graph where every node has the same set of attributes. • NUH data can be naturally represented in RDBMS by recursive tables where the parent-child relationship is implemented in a way that if record x is a child of record y, then the value of x’s parent_id column is the same as the value of the id column of y’s. • Standard SQL does not support for general query on NUH data in basic recursive tables. • Oracle comes with a native mechanism for general query on NUH data and beyond, known as connect-by. For all its elegancy and usefulness, it is short in two accounts: it is slow in cases due to the fact that the parent-child relationship cannot be directly indexed; and it is Oracle dependent in that SQL queries using connect-by cannot be easily adapted to RDBMS of other vendors.
Basic Recursive Table Design Basic Columns • Xid --system assigned unique id • Parent_xid --xid of parent of this entry • Entry_code --content unique identifier of the entry • Normal_stuff --one or more such columns for content values Some variant: Use a separate table to store the hierarchical relationship consisting essentially of two columns: xid/child_xid and parent_xid; and use FK to link the table to the main data table
Basic Recursive Table Query Mechanisms • Oracle-native Connect by • K-way self outer join (for up to level k depth) • Other??
Basic Idea of Path Code Approach • A node of a tree is fully determined by the path from the root to itself. • Path code as full representation of the path can be very compact in length, in the order of logarithmic of total size of the tree. • Path Code can be maintained dynamically feasibly. • Path code permits direct indexing.
Path-code enhanced recursive table design Basic Columns • xid • parent_xid • path_code --code of the path for the node (detail later) • entry_level --level of the record in the tree the entry belongs to • sibling_no --sequence no of the child entry_code with respect to the parent • is_leaf --1/0 for being a leaf/not a leaf • Entry_code --content unique identifier of the entry • normal_stuff --one or more
Value Setting for H columns (I) • Parent_xid set as usual • Sibling_no can be set according to any ordering, e.g. according to entry_code, starting at 1 for each parent; the sibling_no of root entries are set as if those roots were children of a super root; • Entry_level can be set from top down, having entry_level=0 for all root entries; and X.entry_level=k+1 if X has parent Y and Y.entry_level=k; • Is_leaf =0/1 if there is child of the node/not so
Value Setting for H columns (II) • Path_code • for root entries X: X.path_code = to_char(X.sibling_no, ‘00’) • for non-root entries X with X.parent_xid=Y.xid: X.path_code = Y.path_code||to_char(X.sibling_no,’00’) Path_code of a node N at level k has k+1 sections; level j section is left-zero padded string conversion of sibling_no of N or N’s parent at level j; For convenience, the last (rightmost) section is called the base section, the concatenation of all the non-last sections is called the ancestor section.
Example of H-value setting LOGO EC for entry_code EL for entry_level XID for xid PC for path_code format assumption Node uniform (see next) Section length =2 String expression in format (Same assumption for later code examples) • Explanation • Path_code is in the uniform format • Path_code order is based on entry_code order but not on XID order. It could be otherwise. • Path_code of a child is the path_code of its parent plus its base section code. • Sibling_no is not shown but assumed to be in accordance with entry_code. • Entry_code and xid value settings can be independent of each other. • parent_xid, sibling_no, is_leaf and other fields are not shown.
Variants of path_code pattern (advanced topic) • node uniform: every section of all path codes has equal length (a simplest; and it is used in the previous example) • Level uniform: every section of the same level of all path_codes has equal length • Parent uniform: every child node of any parent node has equal path_code length • Dot (or delimiter) uniform: use the same delimiter character (e.g. dot) to separate all sections of all path_codes • Min uniform: the length of base section of the path_code is always maintained to be minimum • String/Binary/hex/ in expression and interpretation, sorting relevant, etc. • Sparse uniform: path_code sections each allows more values than actually and currently needed, for easing subsequent node insertions.
Query Patterns • Get all children of a parent P select * from T where path_code like P.path_code||’%’ • Get all ancestors if a child C select * from T where C.path_code like path_code||’%’ • Get all siblings of a node N select * from T where parent_xid = N.parent_xid
DML Patterns (insert at end) • (Insert record with path_code and sibling_no as null) insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff; • (Update sibling_no) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid) where xid = c.xid; • (Update path_code) update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid; • (reset is_leaf for p, detail omit)
DML Patterns (insert in middle) • (Insert record with path_code and sibling_no as null) insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff; • (Update sibling_no for those siblings elder than c) update T set sibling_no = sibling_no + 1 where parent_xid = p.xid and entry_code >c.entry_code; • (Update sibling_no for c) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid and entry_code < c.entry_code; • (Update path_code for c) update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid; • (Update path_code for those siblings elder than c and all decendents of those elder siblings, pcs_length stands for path_code section length) update T set path_code = substr(path_code,1, pcs_length*entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(entry_level+1)+1) where path_code like p.path_code||’%’ and path_code > (select path_code from T where xid = c.xid)
DML patterns (delete) • (Delete node C and all its decendents) delete from T where path_code like C.path_code||’%’; • (Shifting sibling_no for those siblings elder than C) Update T set sibling_no = sibling_no -1 where parent_xid = P.parent_id and sibling_no >C.sibling_no; • (Shifting path_code for those siblings elder than C and their decendents) Update T set path_code = substr(path_code,1, pcs_length*C.entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(C.entry_level+1)+1) where path_code like P.path_code||’%’ and path_code > C.path_code; • (reset is_leaf of the parent P of C according to whether P has other children)
Complexity Analysis • Space • length of path_code increases in order logarithmic of the total number of rows in the table (for non-degenerated hierarchical data). • e.g. length of c*20 vs 1M rows • Time • queries execute very much based on index range scan, usually the fastest available. • Inserts/deletes may involve sub-tree processing for delete or update.
A stretched idea on RDBMS design • Make entry_id and parent_entry_id relationship declarative; • Enforce hierarchy constraint to the effect that each node can only have zero or one parent node; • Create and maintain path_code, entry_level, etc by RDBMS like creating and maintaining functional indexes; • Add syntax to SQL similar to Oracle’s connect-by, but with the extra of taking advantage of the hidden indexes.
Additional Questions and References • Whether and how to generalize the design for node non-uniform hierarchical data? • To see latest alternative approaches, e.g. http://www.inconcept.com/JCM/May2005/David.html(Using ANSI SQL as a Conceptual Hierarchical Data Modeling and Processing Language for XML, by Michael M David)