1 / 18

Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach

Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach. Charles Yu Database Architect Elance Inc. cyu@elance.com charles.yu@acm.org 2005-08. Background.

maalik
Download Presentation

Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Table & Query Design for Hierarchical Data without CONNECT-BY -- A Path Code Approach Charles Yu Database Architect Elance Inc. cyu@elance.com charles.yu@acm.org 2005-08

  2. Background • Node-Uniform Hierarchical (NUH for short) data can be visualized as a tree or forest graph where every node has the same set of attributes. • NUH data can be naturally represented in RDBMS by recursive tables where the parent-child relationship is implemented in a way that if record x is a child of record y, then the value of x’s parent_id column is the same as the value of the id column of y’s. • Standard SQL does not support for general query on NUH data in basic recursive tables. • Oracle comes with a native mechanism for general query on NUH data and beyond, known as connect-by. For all its elegancy and usefulness, it is short in two accounts: it is slow in cases due to the fact that the parent-child relationship cannot be directly indexed; and it is Oracle dependent in that SQL queries using connect-by cannot be easily adapted to RDBMS of other vendors.

  3. Basic Recursive Table Design Basic Columns • Xid --system assigned unique id • Parent_xid --xid of parent of this entry • Entry_code --content unique identifier of the entry • Normal_stuff --one or more such columns for content values Some variant: Use a separate table to store the hierarchical relationship consisting essentially of two columns: xid/child_xid and parent_xid; and use FK to link the table to the main data table

  4. Basic Recursive Table Query Mechanisms • Oracle-native Connect by • K-way self outer join (for up to level k depth) • Other??

  5. Basic Idea of Path Code Approach • A node of a tree is fully determined by the path from the root to itself. • Path code as full representation of the path can be very compact in length, in the order of logarithmic of total size of the tree. • Path Code can be maintained dynamically feasibly. • Path code permits direct indexing.

  6. Path-code enhanced recursive table design Basic Columns • xid • parent_xid • path_code --code of the path for the node (detail later) • entry_level --level of the record in the tree the entry belongs to • sibling_no --sequence no of the child entry_code with respect to the parent • is_leaf --1/0 for being a leaf/not a leaf • Entry_code --content unique identifier of the entry • normal_stuff --one or more

  7. Value Setting for H columns (I) • Parent_xid set as usual • Sibling_no can be set according to any ordering, e.g. according to entry_code, starting at 1 for each parent; the sibling_no of root entries are set as if those roots were children of a super root; • Entry_level can be set from top down, having entry_level=0 for all root entries; and X.entry_level=k+1 if X has parent Y and Y.entry_level=k; • Is_leaf =0/1 if there is child of the node/not so

  8. Value Setting for H columns (II) • Path_code • for root entries X: X.path_code = to_char(X.sibling_no, ‘00’) • for non-root entries X with X.parent_xid=Y.xid: X.path_code = Y.path_code||to_char(X.sibling_no,’00’) Path_code of a node N at level k has k+1 sections; level j section is left-zero padded string conversion of sibling_no of N or N’s parent at level j; For convenience, the last (rightmost) section is called the base section, the concatenation of all the non-last sections is called the ancestor section.

  9. Example of H-value setting LOGO EC for entry_code EL for entry_level XID for xid PC for path_code format assumption Node uniform (see next) Section length =2 String expression in format (Same assumption for later code examples) • Explanation • Path_code is in the uniform format • Path_code order is based on entry_code order but not on XID order. It could be otherwise. • Path_code of a child is the path_code of its parent plus its base section code. • Sibling_no is not shown but assumed to be in accordance with entry_code. • Entry_code and xid value settings can be independent of each other. • parent_xid, sibling_no, is_leaf and other fields are not shown.

  10. Variants of path_code pattern (advanced topic) • node uniform: every section of all path codes has equal length (a simplest; and it is used in the previous example) • Level uniform: every section of the same level of all path_codes has equal length • Parent uniform: every child node of any parent node has equal path_code length • Dot (or delimiter) uniform: use the same delimiter character (e.g. dot) to separate all sections of all path_codes • Min uniform: the length of base section of the path_code is always maintained to be minimum • String/Binary/hex/ in expression and interpretation, sorting relevant, etc. • Sparse uniform: path_code sections each allows more values than actually and currently needed, for easing subsequent node insertions.

  11. Query Patterns • Get all children of a parent P select * from T where path_code like P.path_code||’%’ • Get all ancestors if a child C select * from T where C.path_code like path_code||’%’ • Get all siblings of a node N select * from T where parent_xid = N.parent_xid

  12. DML Patterns (insert at end) • (Insert record with path_code and sibling_no as null) insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff; • (Update sibling_no) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid) where xid = c.xid; • (Update path_code) update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid; • (reset is_leaf for p, detail omit)

  13. DML Patterns (insert in middle) • (Insert record with path_code and sibling_no as null) insert into T(xid,parent_xid,entry_level, entry_code, normal_stuff) values c.xid,p.xid, p.entry_level + 1, c.entry_code, c.normal_stuff; • (Update sibling_no for those siblings elder than c) update T set sibling_no = sibling_no + 1 where parent_xid = p.xid and entry_code >c.entry_code; • (Update sibling_no for c) update T set sibling_no = (select max(sibling_no)+1 from T where parent_xid = p.xid and entry_code < c.entry_code; • (Update path_code for c) update T set path_code = p.path_code || to_char(sibling_no, '00') where xid = c.xid; • (Update path_code for those siblings elder than c and all decendents of those elder siblings, pcs_length stands for path_code section length) update T set path_code = substr(path_code,1, pcs_length*entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(entry_level+1)+1) where path_code like p.path_code||’%’ and path_code > (select path_code from T where xid = c.xid)

  14. DML patterns (delete) • (Delete node C and all its decendents) delete from T where path_code like C.path_code||’%’; • (Shifting sibling_no for those siblings elder than C) Update T set sibling_no = sibling_no -1 where parent_xid = P.parent_id and sibling_no >C.sibling_no; • (Shifting path_code for those siblings elder than C and their decendents) Update T set path_code = substr(path_code,1, pcs_length*C.entry_level) || to_char(sibling_no, '00')||substr(path_code, pcs_length*(C.entry_level+1)+1) where path_code like P.path_code||’%’ and path_code > C.path_code; • (reset is_leaf of the parent P of C according to whether P has other children)

  15. Complexity Analysis • Space • length of path_code increases in order logarithmic of the total number of rows in the table (for non-degenerated hierarchical data). • e.g. length of c*20 vs 1M rows • Time • queries execute very much based on index range scan, usually the fastest available. • Inserts/deletes may involve sub-tree processing for delete or update.

  16. Comparison with Connect-By

  17. A stretched idea on RDBMS design • Make entry_id and parent_entry_id relationship declarative; • Enforce hierarchy constraint to the effect that each node can only have zero or one parent node; • Create and maintain path_code, entry_level, etc by RDBMS like creating and maintaining functional indexes; • Add syntax to SQL similar to Oracle’s connect-by, but with the extra of taking advantage of the hidden indexes.

  18. Additional Questions and References • Whether and how to generalize the design for node non-uniform hierarchical data? • To see latest alternative approaches, e.g. http://www.inconcept.com/JCM/May2005/David.html(Using ANSI SQL as a Conceptual Hierarchical Data Modeling and Processing Language for XML, by Michael M David)

More Related