1 / 37

How to Implement Data Integrity In SQL Server

How to Implement Data Integrity In SQL Server. Louis Davidson (drsql.org) drsql@hotmail.com. Why did I choose data integrity as a topic?. Answer 1 If I obviously lie to you, will you trust me? If your data obviously lies to your customer, will they trust it?

isaiah
Download Presentation

How to Implement Data Integrity In SQL Server

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Implement Data Integrity In SQL Server Louis Davidson (drsql.org) drsql@hotmail.com

  2. Why did I choose data integrity as a topic? • Answer 1 • If I obviously lie to you, will you trust me? • If your data obviously lies to your customer, will they trust it? • For data to become information, it has to be as trustworthy as reasonably possible. • Answer 2 • If I were the judge and was convicting someone of poor data integrity, I would sentence them to write/maintain ETL • I wrote this slide at 12:49am, 8/14/2013 because I had to get up and fix a data integrity issue

  3. First Line of Defense – Testing and Requirements • First, know what your user wants (Requirements) • Build queries to check the data is within tolerances as you build • Define both illegal values and exceptional values • Age for DayCare Student: • Legal: 1-8 Illegal: Everything else • Outside the norm, but perhaps possible: 1, 2, 6, 7, 8 • Save these queries as you go • Test during all phases of the project • Design • Development • Customer testing • Production • No matter if you follow any of my following advice and let your tables go naked, these scripts can be used to verify data is within tolerances

  4. Requirements tell you what to test for, now WHERE/HOW to implement? • DB • Classic client server: “May I save this data, please?” • Very trustworthy data protection • Middle Tier, Rules Engine • Built in tools that most programmers understand • Flexible to change as users need them to • UI • Friendly for the user • Provide immediate feedback for nominal rules, limiting bandwidth utilization

  5. No one place can satisfy well enough (But…) • DB • No interactive protection, limit to 100% true rules • Extremely limited flexibility • Middle Tier, Rules Engine • Suffers under highly concurrent situations • Difficult for Inter-row, Inter-table rules • Difficult to use with tools like SQL, SSIS • UI • Must be recoded for every form/screen • Very limited rule set that can be enforced

  6. Database Layer Responsibilities • 100% Rules • Always true • Usually very simple rules • Failure to meet the prescribed condition would be harmful to the software (and possibly the users of the software) • Other layers repeat some rules and implement everything else

  7. Database tier layered approach • Keep it simple • Enforce integrity via (Our Agenda for the next 1hr) • Structure - providing correct places to store data • Keys - protecting uniqueness • Relationships -foreign keys • Domains - limiting data points to size/values that are legit • Conditions - required situations (Customer may have only 1 primary address; No overlapping ranges; etc)

  8. But We Don’t Want Errors from the Data Tier! • A frequent concern of non-data tier programmer • Even if you put no constraints you are apt to get errors • You are always likely to get deadlocks • And if your indexing isn’t great, you may get them frequently • Best to code error handler that handle any error condition regardless • If the other tiers handle all of the errors, then the database protection should remain silent • Except perhaps during testing/coding

  9. Structure • Match the user's needs precisely to the design with room for growth • Getting design to match the user's needs will get you way down the road to integrity • Normalization will usually get the car fueled up and started • Naming stuff well doesn’t hurt either… • Getting it right can only be done by understanding the users requirements • I promise, no more requirement talk

  10. If your structure is wrong…Users will find a way • Requirement: Store information about books • What is wrong with this table? • Lots of books have > 1 Author. • What are common way users would “solve” the problem? • Any way they think of! • What’s an another common way someone might fix this? BookISBNBookTitleBookPublisher Author =========== ------------- --------------- ----------- 111111111 Normalization Apress Louis222222222 T-SQL Apress Michael333333333 Indexing Microsoft Kim444444444 DB Design Apress Jessica , Louis and Louis & Louis 444444444-1 DB Design Apress Louis

  11. Close, but still quite messy • Add a repeating group? • But now how to represent who was the primary author on the book? BookISBNBookTitleBookPublisher … =========== ------------- --------------- 111111111 Normalization Apress …222222222 T-SQL Apress …333333333 Indexing Microsoft …444444444 Design Apress … Author1 Author2 Author3 ----------- ----------- ----------- LouisMichaelKimJessica Louis

  12. Now, the structure protects the data… • And it gives you easy expansion BookISBNBookTitleBookPublisher =========== ------------- --------------- 111111111 Normalization Apress 222222222 T-SQL Apress 333333333 Indexing Microsoft444444444 Design Apress BookISBN Author =========== ============= 111111111 Louis222222222 Michael333333333 Kim444444444 Jessica ContributionType ---------------- Principal Author Principal Author Principal Author Contributor Principal Author 444444444 Louis

  13. Keys • Defending against duplication of data where it oughtn't be duplicated • Artificial Key (Identity/GUID/Sequence generated value) should NOT be the only key • When employed, Artificial Key is for tuning, Natural Key is for the user • Avoid giving users sequentially created values • Well, I am account 0000001, what about account 0000002

  14. Uniqueness Counts • Requirement: Table of school mascots • For a row to be truly unique, some manner of constraint needs to be on column(s) that have meaning • It is a good idea to unit test your structures by putting in data that looks really wrong and see if it stops you, warns you, or something! ~~~~~~~~~~~ ~~~~~~~~~~~ Color----------- Black/BrownBlack/WhiteSmoky Brown School----------- UTCentral HighLess Central HighSouthwest Middle MascotId Name=========== -----------1 Smokey112 Smokey4567 Smokey 979796 Smokey

  15. Key Constraints • Applied to protect data from duplication • May help performance, but should exist even if never used for a query • Part of the data structure – applied with ALTER TABLE – unlike indexes, which are generally attached for performance • NULLs • Primary Key – No NULLs Allowed • Unique – NULL allows, but treated as a single value • Table Clustering • Usually makes sense for the primary key to be clustered (not a hard and fast rule though) • Key constraints valuable with or without clustering

  16. Demo – Key Constraints (and a wee bit more)

  17. Relationships • Establishes a connection between two tables • Probably the most trouble to implement from outside of the database • Concurrent users means data can change • Caching all data is really costly (particularly to keep up to date with multiple caching servers for inserts, updates, and deletes!) • Using foreign key constraints means these types of queries always return the same value: • SELECT COUNT(*) FROM InvoiceLineItem • SELECT COUNT(*) FROM Invoice JOIN InvoiceLineItem ON Invoice.InvoiceId = InvoiceLineItem.InvoiceId

  18. Foreign Key Constraints • Like CHECK CONSTRAINTs, are part of the table structure • One table can reference another’s PRIMARY KEY key columns, or even the UNIQUE key columns • Indexing the child’s reference key can be helpful in many cases • Usually extremely fast, even on very large tables • As long as key’s underlying indexes maintained • For integer keys, a B-Tree index can search millions of rows in a few reads

  19. Foreign Key Cascading • Can define cascading operations • UPDATE CASCADE– Deleting the parent deletes the children • DELETE SET NULL – Updating the parent key set the child reference key to NULL • DELETE SET DEFAULT – Deleting the parent row sets the child row to the default • UPDATE NO ACTION – Fail if any child rows exist – THE DEFAULT • Or other combinations of DELETE and UPDATE with CASCADE, SET NULL, SET DEFAULT, or NO ACTION • DELETE CASCADE operations should be limited, to avoid surprises • Use UPDATE CASCADE where you have updatable primary keys. Changing a primary key with references is messy. • Multiple or Cyclic cascade paths require INSTEAD OF triggers or procedures to implement

  20. Demo – Foreign Keys

  21. Domains • Defining the domain of an object or column • Table - Customers? All customers or certain types? • Column • Integer? Or Whole number between 0 and 10,000,000 • True Unicode Value accepting 64K Characters? Or simple AlphaNumeric? • Can you accept 2GB of Text (varchar(max))? • Goal 0% chance of defects • No situational intelligence • If there can be ANY variation, then the domain includes the variations • Can't fight users doing dumb stuff

  22. Please don’t do this. Please? CREATE TABLE object (objectIduniqueidentifier, fillMeUpvarchar(max) )

  23. Extreme Bucket Datatypes • numeric(38,2) • Max value: 999,999,999,999,999,999,999,999,999,999,999,999.99 • Bill gates worth: < $99,999,999,999.99 • US National Debt + All personal Debt: < $99,999,999,999,999.99 • For a nutty value: Distance to nearest galaxy in inches, yes, inches ~74488200000000000000000.00

  24. Extreme Bucket Datatypes - Strings • varchar(8000) • abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz • That is just 780 characters! • Note: If you allow N characters, your apps should minimally test for N (successfully), and N + 1 characters (error)

  25. On the other hand, don’t be over restrictive Why did they name me 555 95472? Now I can’t go to school because of the stupid school database! http://peanuts.wikia.com/wiki/File:555.jpg

  26. Single Column Domain • What data is EVER "legal" for a column • Most data integrity issues are due to lack of domain control • Missperllings: TN, TNN, TENN, TENNESEE, TINNESEE • Bad values: -1 for Age, NULL for required value, Random default value chosen • Implementation Includes • Intrinsic data type • Optionality (NULL v NOT NULL) • Default Value • Simple predicates • Check constraint • Domain table • Forcing the Issue: Trigger

  27. Multiple column • Where the domain of one column is affected by the domain/value of another • Examples: • if col1 = 1 then col2 in (1,2,3) else col 2 in (3,4,5) • If col1 = 'bob' then col2 is NOT NULL • Usually implemented with a CHECK constraint

  28. Multiple Column Concerns • Minimize these conditions to only where necessary to avoid illogical/illegal data • RefusedToGiveBirthDateFlag = 1 AND BirthDate is null • Questionable: if DiscountPercent> .5, then ApproverUserId is not null. • Likely Contraindicated - Processing Situations • The user enters Date1 always before Date2 • The ship date must be after the order date • Avoid domains based on data in other tables because data in other tables can shift, leading to messy situations • discountPercent > .5 and savingUser.needsApproverFlag = 1 then ApproverUserId is not null • What happens if you change/delete the user that is referenced in savingUser?

  29. Check Constraints • Applied to complete implementation of 99.9% of simple domains • May help performance because it gives the optimizer knowledge of the data • Part of the data structure – applied with ALTER TABLE • Simple predicate implementation • If any column allows NULL, the expectation is that NULL is an acceptable answer unless specifically coded for • Hence, to fail CHECK condition, the answer must be FALSE (unlike WHERE clauses that succeed only when the result is TRUE) • 1=1 TRUE – Acceptable for WHERE or CHECK • 1=NULL UNKNOWN – Succeeds for NULL Column CHECK CONSTRAINT ONLY • 1=2 FALSE – Fails for both

  30. Demo – Domains

  31. Conditions • Making sure that some condition is met reliably • Examples • Row Modification Details • Overlapping Ranges • Big decisions here • Non-trivial to implement • Feels natural to do it non-data tier code • However non-data tier code: • Can be less reliable • Can be greatly affected by concurrency

  32. Tools • Triggers • Instead of Trigger to Automatically Maintain Values • After to validate complex conditions that must be constantly true • SQL • Optimistic Locking to avoid heavy locking without lost updates

  33. Demo – Protecting against Conditions

  34. Performance Concerns… • For most everything you will commonly need, it can be based on basic declarative integrity constraints • By now, there will be some concern about performance • Performance WILL be impacted, • Done well: almost negligible • Done poorly: can lots of pain • The next demo will do a non-scientific, single user job of showing the performance hit is noticeable, but not tremendous…

  35. Demo –Performance

  36. Summary • Getting the structure correct is a great start towards data integrity • Make sure column values are always within an acceptable tolerance so software doesn’t break • Employ all of the tools SQL Server gives you to help ensure data integrity • Use non-data tier software to ensure errors that return from the data tier are extremely rare • The key word is: teamwork. You can’t do an adequate job of protecting data in the UI, Business/Object or Data tiers alone

  37. Trust but verify • Never stop testing the data, even into production • Be vigilant • Test the structures to make sure constraints not disabled and are trusted • Test data that is not constrained in a 100% manner • Use your slow periods wisely, running tests regularly • Even 1 bad row that a customer notices means they may no longer trust the data…

More Related