710 likes | 826 Views
Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection. Evan Davies Office of Strategic Planning and Analysis The College of William and Mary in Virginia . Overview.
E N D
Auto-Validating Your Data Store: A do-it-yourself approach to data integrity and anomaly detection.Evan DaviesOffice of Strategic Planning and AnalysisThe College of William and Mary in Virginia
Overview • Institutional researchers increasingly rely on data marts, stores, and warehouses for management information. • Given the perpetually ‘developing’ status of these environments, both commercial and institutional, you can spend significant time dealing with data that does not meet even shifting standards for table logic, variable conventions, and values. • Such anomalies can stop production programs or lead to inaccurate information until detected.
Does this ever happen to you? • NOTE: Table WORK.ENROLL created, with 8241 rows and 19 columns. • NOTE: Table WORK.PERSON created, with 8241 rows and 13 columns. • NOTE: Table WORK.MAJOR created, with 8253 rows and 30 columns. • NOTE: Table WORK.MAJOR_PERSON created, with 8264 rows and 30 columns. • Who are these extra people? Why are they in here? •
Agenda • This presentation demonstrates a simple yet sophisticated way of using SAS® Enterprise Guide to check tables, variables, and values automatically to find out if that data meets IR standards and premises before you start analytical work (and to let others know that things may need fixing).
Things You Should Know… • ‘Simple’ is a relative term, as is ‘sophisticated’ • This involves more coding than mouse clicking • You should have at some concepts of SAS® coding, SQL, and relational databases • To make this work back at your campus, you need to know (or find out) how to access your data • If you have significant structured programming and SQL experience, please refrain from laughing out loud. Snickering is acceptable.
The College of William & Mary • The only royally chartered colonial institution, 1693, by King William III and Queen Mary II… • Making it the second oldest college in the United States • Phi Beta Kappa, the first Greek honor society, was founded here in 1776 • Became state-supported in 1906 and coeducational in 1918 • The AlmaMater of George Washington and Thomas Jefferson , as well as Jon Stewart and Secretary of Defense Robert Gates • Named one of Intel's 50 Most Unwired College Campuses for our campus-wide wireless network • The Colonial Campus section of the 1,200 acre campus is restored to its 18th-century appearance
The Wren Building (1700)The Oldest Academic Building in Continuous Use in the U.S.
The College Today… • 5,800 undergraduates and 1,950 graduate students from all 50 states and 30 foreign countries • 22 percent are students of color • 79 percent of freshmen graduated in the top ten percent of their class • Highest SAT middle 50th range of all public institutions in Virginia • 11:1 student-faculty ratio • W&M has more recipients of the Commonwealth's Outstanding Faculty Award than any other institution • 5 undergraduate and graduate schools: Arts & Sciences, Business, Education, Law, and Marine Science • 36 undergraduate programs, 12 masters , doctoral, and professional degrees • is a Highly Selective Public Liberal Arts University
External System Interfaces/Entities College Systems that Interface with Administrative Computing Systems ∙Bank ∙ACH Direct Deposits ∙Check Verification ∙Cancelled Checks ∙Lock-Box Payments Federal Govt ∙IRS/SSA ∙W2s/1099s VA APA VA SCHEV VA Dept of Taxation ∙W2s/1099s ∙CDS Vendors FRS Financial Record System ∙Accounts Payable ∙Purchasing ∙General Ledger ∙Budget ∙CARS ∙Interface/Recon ∙EDI & CDS Vendors ∙Financial Statements ∙Grants Account ∙Fixed Assets (6/99) Old SIS Student Info System ∙Residence Life ∙Student Billing ∙Transcripts Office Supply Vendor Office Supplies Systems Cash Receipts System VA DOA CARS System ∙Payroll Acct ∙EDI to W&M ∙CDS Pymts ∙EDI Vendor Pymts ∙Expense/Cash Acct ∙Agency/CPRS Tape SAT Scores Warehouse Systems Data History We had many homegrown systems, some based on Information Associates® architecture, but extensively modified. FAACS (Old Fixed Assets) The College of William and Mary ACT Scores GRE Scores Work Order System Federal Dept Labor ∙Employment Compliance MAT Scores MCAT Scores VA DPT ∙BES System ∙Benefits Eligibility WORCS Student Web DARS NCAA VA VRS ∙Retire, Life Ins, Opt Life Faculty Salary Tracking HRS Human Resource System ∙Applications ∙Personnel Mgmt ∙Position Control ∙Benefits Mgmt ∙Work Study ∙Payroll/Account ∙A21 Certification ∙CARS Interface New SIS Student Info System ∙Prospects ∙Admissions ∙Student Records ∙Registration ∙Course Schedule ∙PO Box Management Old Financial Aid Prospective Student Search VA VEC ∙Unemployment Benefits Schedule 25 Resource 25 Leave Account FSA Administrator Athlete Tracking Peterson’s Pros. Student Data Old Campus Police TIAA - CREF ∙Annuity Eligibility 1500 Hour Tracking Web Applications (IP) Benefits Vendors National Clearing HouseLoan Verification VA DPT PMIS ∙Personnel Mgmt College Systems that Interface with Administrative Computing Systems VA DPB ∙Budget Admin SSA/FICA Deposits Payroll Office Checks 1-2-3 ∙Pay Checks ∙Direct Deposit Stubs ∙W2s ∙AP checks ∙1099s General Accounting Office New Campus Police Parking System SWEM Patron Info System Mysoft Call Account Sys Telecom Power FAIDS New Financial Aid PELL Grants Social Security Administration ISIR F.A. Input Identification System Food Services US Savings Bonds Admin Payroll Office Student Health System Door Access System Alumni Development System Fed Reserve ∙US Savings Bonds Wiz Kid
Recent Data History • After false starts, in 2003 we bought a large Banner enterprise system, and then added a datamart, and then a data store. We are now five years into our two year installation period. • Or put another way, we are ‘current’ on version(s) of 8.2 and 3.1, with new version(s) around the corner. • When the IR staff start to understand the relationships and foibles of a particular version, it is time to upgrade to a newer version in which some things are fixed… and some other things are brokendifferently enabled. v.6, v.7, v.8… v.1, v.2… v.1, v.2, v.3….
Things To Realize… • SungardHE Banner® products are not a bad system. Nor are any other commercial vendor’s products. • Any enterprise-level system with a data store is a permanently evolving, almost organic entity, • with multiple input opportunities for breaking constraints and premises • and for finding new ways to induce unexpected results through changing business rules, institutional decentralization, flexibility, and collegiality.
Data Integrity in Pre- and Post-Enterprise Systems • Old History: I.T. used to do a system General Edit, with “general” meaning edited for operational purposes, not analytical purposes. If the data got the payroll to run today or allowed you to admit a student, it was ‘valid’. • New History: Data validity is still measured against operational standards.
Data Integrity Post-Enterprise System • And I.T. is now even busier than ever just making the production system ‘run’, without markedly larger numbers of staff to commit to the data warehousing activity. They actually do less general editing because more transactions are interactive rather than process-oriented. • At the end of a day, data is off-loaded into the data store, in different forms and with different premises from how the data is held from the production system.
Data Integrity Pre- and Post-Enterprise System • This means that there is now an even bigger split between operational and analytical data integrity, especially in terms of the differing forms of the data. • The data store is not as rigorously evaluated for data integrity or meaning precisely because it is not production. And since the operational offices of the institution are satisfied with their pieces of the data pie, everything is working fine.
The Institutional Research Role • IR has always had premises for data reporting that go beyond ‘general edit’. • We analyze data relationships that make up the whole picture. • We work in the aggregate, rather than by individual transaction. • We are ideally poised to discover the anomalies that occur between multiple institutional sources -- between one office’s interpretation of a transaction and another office’s idea of the same information.
A Data Store’s Added Task to IR • IR now has to do the same comprehensive validation tasks that we used to do, plus identify and deal with the newer ‘introduced’ problem of tables that violate their own premises or have unexpected values due to complex dynamics among: UPGRADE
The Complex Dynamics • imperfect translation of data between the production transaction system and the data store/warehouse, • vendor maintenance or institutional business rules changes that intentionally induce changes in tables. • generally caused by the inability to predict all of the systemic results of making unspecified or unimagined changes in a system, sometimes known as the Butterfly Effect. • the continuous upgrade cycles applied to the system, • and data that results from imperfectly recorded transactions in uncertain environments with less than adequate collaboration and training.
So What Can We Do? • Write a program to • check the logic premises of frequently used tables • check for missing or out-of-range values in data that affects IR • run the program frequently to uncover problems in time for the current census and prevent future term anomalies. Keep the results for documentation. • communicate findings promptly and efficiently in order to effect change • build in flexibility to test different things at different times in different ways
Limitations in Place • Do it on your own, since it is for IR purposes. Remember, the data already meets everybody else’s needs. • Use existing resources. Keep it simple. • Someone else is going to have to be notified to deal with the data anomaly once it is identified. • Don’t weaponize the process. This is why we choose to have anomalies, not mistakes.
How To Accomplish? • The use of SAS® EG on a PC platform allows for the remote submission of a premise and data-checking program during off-hours, when demand on the system is lessened. • It also allows a convenient environment in which to store the program and results, access validation and history tables, and send automatic e-mails to interested parties such as admissions, registrar, IR and IT staff.
“SAS® Enterprise Guide®, a powerful Microsoft Windows client application that provides a guided mechanism to exploit the power of SAS and publish dynamic results throughout your institution. It’s the preferred interface to SAS for analysts, statisticians and programmers– SAS Enterprise Guide saves time by automatically generating computer code with an easy point-and-click interface.” • Think of it as a graphical office environment for the SAS language. As Microsoft Outlook ties MS Office products together, Enterprise Guide has a similar role for SAS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Still not clear on SAS/EG® ? • It is an environment in which SAS 9.1 runs; • It can bring together data views and any type of files from any network data servers, including Oracle. • It contains SAS program(s), the larger project, notes, and a graphical description of how the project, processes, and programs relate to each other. • It documents how any processes or programs have been run, and the results. • It can generate code for procedures and datasteps. • It inherits libraries, autoexecs, etc. from 9.1
We Use Enterprise Guide® To… • Schedule and launch the anomaly tracking job • Provide a comprehensive project environment for IR staff to be able to visit independently and add new anomaly checks • Be able to see and modify the associated tables and data in one place • Provide our novice IR staff a more centralized and friendlier view of the process
Can You Just Use SAS® Itself? • YES! • Provided you schedule the job through MS Task Scheduler and have no desire for the previously mentioned features or facilities. • Programmatically, all process features are part of base SAS v9.1 on an XP or Vista platform.
Beginning Steps • Survey yourself and other staff (IR and other offices) to make a preliminary list of tables and values that have issues • Establish the names, emails, and hierarchy of those who will receive automated communications. • Make a calendar of when you want to test certain items due to different functional cycles (admissions, registration, etc)
Let’s Go Coding… • I can’t show you the entire code for my anomaly program today. • It would need to be modified to fit your data structures and needs anyway. • Instead I will concentrate on imparting some: • Key Program Ideas • SQL techniques • to enable you to develop your own program
Set up a SAS program which launches automatically using SAS/EG
Set up a SAS program which launches automatically using SAS/EG
Start by setting up metadata necessary for testing %let term_start_limit=200325; %let term_end_limit =201030; %let highestssn='772'; %let es_validset='EL','MW','WD','WM','WW'; /*all valid statuses encountered by enrolled students AFTER enrollment/dropadd*/ %let es_bad='QW','WB','AW'; /*all bad or ineligible statuses not to be counted by census date*/ %let tooold='1900'; *out-of-range birthday year cutoff;
Start by setting up metadata necessary for testing Macro variable assignment increases flexibility as table names change. %let table_enroll=enrollment; %let table_academic_study=academic_study; %let table_addcurrent=address_current; %let table_prevedslot=previous_education_slot; %let table_schedule=schedule_offering; %let table_course_catalog=course_catalog; %let table_course=student_course;
Start by setting up metadata necessary for testing *format table for anomalies; proc format; value anom 1= 'withdrawn with classes' 2= 'duplicate person recs ' 3-4= 'missing value ' 5= 'ssn out-of-range ' 6= 'Two ids, one ssn ' 7= 'dup recscourse_catalo ' …… ; run;
Create table todayterm as select stvterm_code as studyterm, datepart(stvterm_start_date) as begd , datepart(stvterm_end_date) as endd from (your data source) where today() > datepart(stvterm_start_date) and today() < datepart(stvterm_end_date) ; *set a global study term variable based on value in todayterm; data _null_; set todayterm; call symput('study',studyterm); • Construct some macros to help pass anomaly parameters
Use up to 3 variables to show people what is anomalous about any particular situation. Name them A, B, and C. The variables you will pass to these variables will differ with each problem. • You will need to put both the value and the name of the variable into your anomaly report, plus some identifier(s) for the student or entity, plus some anomaly details. • Construct some macros to help pass anomaly parameters
*Macro to help transfer of anomalies; %macro keep ; (keep= id person_uidavalaval_desc bvalbval_desccvalcval_desc anomstudyterm first_anom_datesuspend_datedata_own) %mend keep; • Construct some macros to help pass anomaly parameters
%macro pass (aval=,bval=,cval=,dsn=,anom=,studyterm=,suspend_date=,data_own=) ; aval_desc = "&aval"; bval_desc = "&bval"; cval_desc = "&cval"; aval = put(&aval,$25.); bval = put(&bval,$25.); cval = put(&cval,$25.); anom=&anom; studyterm=put(&studyterm,$6.); first_anom_date=today(); suspend_date=&suspend_date; data_own=&data_own; %mend pass; • Construct some macros to help pass anomaly parameters PASSES THE NAME OF THE VARIABLE PASSES char VALUE OF THE VARIABLE PASSES OTHER VALUES
proc sql; connect to odbc as mydb (datasrc="&datasrc" user=&user password=&password); create table addressc as select * from connection to mydb ( select m.PERSON_UID, m.id, n.address_type, n.postal_code, n.city, n.county, n.state_province, n.nation from &table_enroll m inner join &table_addcurrent n on m.person_uid = n.entity_uid where m.ACADEMIC_PERIOD in (&study) and ((m.ENROLLED_IND='Y' and m.REGISTERED_IND='Y' ) or (m.ENROLLED_IND='Y' and m.ENROLLMENT_STATUS in (&es_set))) and n.address_type in ('IN', 'P1', 'MA') ) ; • Acquire data for testing from your datastore
…from &table_enroll m inner join &table_addcurrent n on m.person_uid = n.entity_uid where m.ACADEMIC_PERIOD in (&study) and and n.address_type in ('IN', 'P1', 'MA') • Acquire data for testing from your datastore STUDENTS THIS TERM ADDRESSES TO CHECK ALL ADDRESSES
Let your server do the heavy data work • Acquire data for testing from your datastore Result Set Oracle ® SQL call
The most useful technique for detecting a table that violates record premises is by joining the table back to a copy of itself that has been summarized by the number of expected rows, keeping only those records that don’t meet expectations. • A SQL statement that uses a count() function, a ‘group by’ clause, and a ‘having’ clause does this job effectively. • Use Proc SQL and other procedures to test logic/values
select l.* from (select PERSON_UID, ID, ACADEMIC_PERIOD, PROGRAM, PRIMARY_PROGRAM_IND, ADMISSIONS_POPULATION from &table_academic_study where academic_period in (&study) ) l inner join (select person_uid from &table_academic_study where academic_period in (&study) group by person_uid, program having count(person_uid) > 1 ) r on l.person_uid = r.person_uid • Use Proc SQL and other procedures to test logic/values