230 likes | 355 Views
Smarter Sorts. Beyond the Obvious. Jerry Le Breton ( Softscape Solutions) & Doug Lean (DHS). Sorting –The Obvious First. Why Sort ? “Data and information is almost always presented in a sorted or structured way”. Sorting - The Obvious First. proc sort data =claims;
E N D
Smarter Sorts Beyond the Obvious Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) SAUSAG 69 – 20 Feb 2014
Sorting –The Obvious First • Why Sort ? • “Data and information is almost always presented in a sorted or structured way”
Sorting - The Obvious First procsortdata=claims; by claim client; Sort puts your records in order - BY the values of the variables you list. • Its important to know your data • How many variables • How many distinct data values for each SAUSAG 69 – 20 Feb 2014
Sorting – Do You Need To? procsortdata=claims; by claim; Proc tabulate ...; classclaim; ... • Some PROCS do their own sorting: • TABULATE • MEANS • REPORT • SQL • (which can run out of memory for really big data sets) An unnecessary SORT SAUSAG 69 – 20 Feb 2014
Sorting – Do You Need To? • Only use PROC SORT before REPORT, TABULATE, MEANS if there’s another reason later. • For PROC MEANS substitute BY with CLASS e.g. PROC MEANS NWAY; CLASS x y z; Is similar to PROC SORT; BY x y z; PROC MEANS; BY x y z; And saves significant time by avoiding the SORT SAUSAG 69 – 20 Feb 2014
Sort Only What You Need Sort just the rows you want… procsortdata=claims out=Sorted_claims; where client =: 'A'; by claim; … and just the columns you want… procsortdata=claims(keep = c:) out=Sorted_claims; by claim; Leaving out unwanted rows and columns can produce dramatic performance improvements. SAUSAG 69 – 20 Feb 2014
Sorting – Proc Sort vsProc SQL /* SORT Procedure */ procsortdata=claims; byclient claim; run; /* SQL Procedure */ procsql; create table claims as select * from claims order by client claim; quit; • Both will sort your data. • No significant performance difference. • Choose according to clarity, functional requirement and efficiency. • Make it as clear and simple as possible! SAUSAG 69 – 20 Feb 2014
Sorted Status of a Data Set procsortdata=claims; by claim client; Sort status is saved as part of a SAS data set. Sort Information Sortedby CLAIM CLIENT Validated YES Character Set ANSI So SAS won’t waste time re-sorting if it’s already in the required order. SAUSAG 69 – 20 Feb 2014
Setting Sorted Status of a Data Set dataclient_claims (sortedby = client ); merge clients claims; by client ; If you know a data set is sorted, say so with the SORTEDBY= option!. Sort Information Sortedby CLIENT Validated NO Character Set ANSI So SAS won’t waste time re-sorting later. SAUSAG 69 – 20 Feb 2014
Presorted or Notsorted procsortdata=claims out=sorted presorted; by claim; • PRESORTED option for when data probably sorted! • SAS will check and only sort if necessary. procprintdata=grouped_claims; by claim NOTSORTED; • No need to sort if data is grouped BY the required variable – it doesn’t matter its NOTSORTED • (you just have to say so). SAUSAG 69 – 20 Feb 2014
Sorting and Maintaining Order procsortdata=claims; by claim ; • By default, SAS maintains the original order of records within a BY group. procsortdata=claims noequals; by claim ; • Using the NOEQUALS option means SAS won’t necessarily retain the original ordering. • More efficient but, directly affects the results of using NODUPKEY SAUSAG 69 – 20 Feb 2014
Sorting Duplicates procsortdata=claims out=no_duplicates nodupkey; by claim; NODUPKEY effectively keeps the first record of any duplicates. procsortdata=claims out=no_duplicates dupout=dups nodupkey; by claim; DUPOUT= puts the duplicates to a separate table. SAUSAG 69 – 20 Feb 2014
Separating Unique & Duplicate Rows procsortdata=claimsout=sorted ; by claim; run; dataunique_claims dup_claims; set sorted; by claim; iffirst.claim and last.claimthen outputunique_claims; else outputdup_claims; run; It works, but needs an extra pass of the data. SAUSAG 69 – 20 Feb 2014
Separating Unique & Duplicate Rows- the smarter way NOUNIQUEKEY ensures no records with a unique key are written to the OUT= table. procsortdata=claims out=duplicates uniqueout=uniques nouniquekey; by claim; run; …and the UNIQUEOUT= option directs the unique records to a separate table SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitive procsortdata=names out=simply_sorted; by name; Upper case letters are before lower case in the ASCII collating sequence. data names2; set names; upcase_name = upcase(name); procsortdata=names2 out=upcase_sorted(keep=name); byupcase_name; Creating an upper (or lower) case copy of the variable is the old solution. SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitive - Smarter SORTSEQ option specifies the collating sequence (ASCII/EBCDIC/other languages) or, LINGUISTIC option modifies the current collating sequence. procsortdata=names out=linguistic_sorted sortseq=linguistic; by name; The affect is to make the sort case insensitive. SAUSAG 69 – 20 Feb 2014
Sorting – Case Insensitive – with SQL PROC SQL allows the use of functions in the Order By (and other) clauses. procsql; createtablesql_sortedas select * from names orderbyupcase(name); The result is different from Proc SORT using the sorteq=linguistic. SAUSAG 69 – 20 Feb 2014
Sorting Out Spaces procsortdata=names out=simply_sorted; by name; A standard sort is obviously no use. datanames_temp; set names; temp_name = upcase(compress(name)); run; procsortdata=names_temp out=temp_sorted(keep=name); bytemp_name; Creating another variable for sorting, without spaces, is the old solution.
Sorting Out Spaces procsql; createtablesql_sortedas select * from names orderbyupcase(compress(name)); Proc SQL can do it too. procsortdata=namesout=alt_handling_sorted sortseq = linguistic(alternate_handling = shifted); by name; Proc SORT can too! This sub-option of the LINGUISTIC sortseq option, effectively ignores spaces as well as being case-insensitive. SAUSAG 69 – 20 Feb 2014
Sorting by Numbers Sorting text with numeric prefixes e.g. student id and name … procsortdata=students out=simply_sorted; by student; … results in nothing useful! SAUSAG 69 – 20 Feb 2014
Sorting by Numbers datastudents_temp; set students; student_num = input(scan(student,1), 2.); run; procsortdata=students_temp out=temp_sorted(keep=student); bystudent_num; An extra data step can create a numeric variable to sort with (as can SQL of course) procsql; createtablesql_sortedas select * from students orderby input(scan(student,1), 2.); SAUSAG 69 – 20 Feb 2014
Sorting by Numbers procsortdata=students out=num_collation_sorted sortseq = linguistic (numeric_collation=on); by student; The numeric_collation sub-option of the LINGUISTIC sortseq option, sorts by the numeric values that prefix the variable values. SAUSAG 69 – 20 Feb 2014
Questions? • Did you learn something new from this presentation? SAUSAG 69 – 20 Feb 2014