1 / 47

Improving Hash Join Performance By Exploiting Intrinsic Data Skew

Improving Hash Join Performance By Exploiting Intrinsic Data Skew. by Bryce Cutt supervised by Dr. Ramon Lawrence. Introduction. Databases are part of our lives Hash Join is a core database algorithm Very I/O intensive for large databases Queries may take hours

Download Presentation

Improving Hash Join Performance By Exploiting Intrinsic Data Skew

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Hash JoinPerformance By ExploitingIntrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence

  2. Introduction • Databases are part of our lives • Hash Join is a core database algorithm • Very I/O intensive for large databases • Queries may take hours • Any performance improvement is significant • Real datasets contain skew • Skew is when some values occur more frequently • Skew can greatly reduce hash join performance • Skew traditionally considered a bad thing for join algorithms • Try to mitigate negative effects of skew • Adapt hash join • No longer just mitigate • Use foreknowledge of skew • Improve performance

  3. Relational Model Definitions

  4. Example Relations Probe Relation Build Relation Part Purchase

  5. DHJ Algorithm Build Phase Hash Function: modulo 5

  6. DHJ Algorithm Build Phase, cont.

  7. DHJ Algorithm Build Phase, cont.

  8. DHJ Algorithm Build Phase, cont.

  9. DHJ Algorithm Build Phase, cont.

  10. DHJ Algorithm Build Phase, cont.

  11. DHJ Algorithm Build Phase, cont.

  12. DHJ Algorithm Build Phase, cont.

  13. DHJ Algorithm Build Phase, cont.

  14. DHJ Algorithm Build Phase, cont.

  15. DHJ Algorithm Build Phase, cont.

  16. DHJ Algorithm Build Phase, cont.

  17. Probe Relation

  18. DHJ Algorithm Probe Phase

  19. DHJ Algorithm Probe Phase, cont.

  20. DHJ Algorithm Probe Phase, cont.

  21. DHJ Algorithm Probe Phase, cont.

  22. DHJ Algorithm Probe Phase, cont.

  23. DHJ Algorithm Probe Phase, cont.

  24. DHJ Algorithm Probe Phase, cont.

  25. DHJ Algorithm Probe Phase, cont.

  26. DHJ Algorithm Cleanup Phase

  27. DHJ Algorithm Cleanup Phase, cont.

  28. DHJ Algorithm Cleanup Phase, cont.

  29. DHJ Algorithm Cleanup Phase, cont.

  30. Skewed Probe Relation

  31. Statistics and Hash Joins • Modern database systems maintain statistics such as histograms for query optimization • What if hash join could use the statistics to choose the best build tuples to keep in memory? • Does not have to generate own statistics

  32. Histojoin Algorithm General Idea • Same basic form as DHJ • Determines best build tuples from histogram • In this case the tuples with partid 2 and 3 • Create partitions for the best build tuples • In addition to regular partitions • Freeze regular partitions first • Perform a highly optimized multi-stage check • To determine the partition tuples belong in

  33. Histojoin Algorithm Build Phase

  34. Histojoin Algorithm Probe Phase

  35. Implementation Details • Avoided in algorithm description • General enough to fit any database system • But ultimately important •  Core of algorithm implementation specific • Implemented in • Stand alone Java app • Optimistic implementation • PostgreSQL • HHJ • Conservative implementation

  36. Inaccurate Statistics • Selections • Multi-join plans • Sampling • SITs • Handling dependent on implementation •  PostgreSQL conservative memory usage

  37. Experimental Results • TPC-H • Database commonly used to test database system performance • Skewed versions • 1GB dataset used in Java tests • 10GB dataset used in PostgreSQL tests

  38. Experimental Results, cont. Java, Lineitem/Part, skewed, 1GB Approx. 20% faster

  39. Experimental Results, cont. Java, Lineitem/Part,high skew, 1GB Approx. 60% faster

  40. Experimental Results, cont. Java, Various Joins, Percent Improvement, 1GB Approx. 20% for skewed and 60% for high skew

  41. Experimental Results, cont. Java, Lineitem/Part, Inaccurate Histogram, 1GB

  42. Experimental Results, cont. Java, Lineitem/Part/Supplier,high skew, 1GB Approx. 75% faster

  43. Experimental Results, cont. PostgreSQL, Lineitem/Part,skewed, 10GB Approx. 10% faster

  44. Experimental Results, cont. PostgreSQL, Lineitem/Part, high skew, 10GB Approx. 60% faster

  45. Experimental Results, cont. PostgreSQL, Various Joins, Percent Improvement, 10GB 5-10% for skewed and 50-60% for high skew

  46. Conclusion • Histojoin • significantly outperforms standard hash joins in the presence of skew • Smart implementation mitigates pitfalls • Two papers have been published from this work • PostgreSQL patch currently in review • Will be used by millions of users

  47. Thank you Thank you Dr. Lawrence

More Related