1 / 13

Jim Groeneveld, OCS Consulting, ´s Hertogenbosch, Netherlands. PhUSE 2011

PhUSE 2011. Comparing dataset metadata. Jim Groeneveld, OCS Consulting, ´s Hertogenbosch, Netherlands. PhUSE 2011. Comparing dataset metadata. AGENDA / CONTENTS Comparing dataset data and metadata PROC COMPARE macro %CrossRef Dataset and variable attributes Example results (in dataset)

yama
Download Presentation

Jim Groeneveld, OCS Consulting, ´s Hertogenbosch, Netherlands. PhUSE 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PhUSE 2011 Comparing dataset metadata Jim Groeneveld, OCS Consulting, ´s Hertogenbosch, Netherlands. PhUSE 2011

  2. Comparing dataset metadata • AGENDA / CONTENTS • Comparing dataset data and metadata • PROC COMPARE • macro %CrossRef • Dataset and variable attributes • Example results (in dataset) • Dataset attributes • Variable attributes • Application of macro %CrossRef • Some technical information • Future features

  3. Comparing dataset metadata • A. Comparing dataset data and metadata • PROC COMPARE • data oriented (attributes: NOVALUES option) • only 2 datasets (or variables in one) at a time • cumbersome output (summary: OUT= dataset) • may be tuned as desired, yet limited to pairs • SAS macro %CrossRef • structure oriented: dataset & variable attributes • any number of specified datasets (from 1) • tabular summarisation (in result dataset only) • columns: dataset names; rows: attributes • user specification of desired attributes

  4. Comparing dataset metadata • B. Dataset and variable attributes • Dataset attributes • MemName, MemLabel and LibName • Creation and Modification date and time • Number of variables and physical observations • Variable attributes • Name (common name in first attribute column) • Label: as value in above Name attribute record if no label then text: "-no label-" if no corresponding variable: empty • optional variable’s Type and Length (combined) • optional variable’s Informat and Format

  5. Comparing dataset metadata • C.Example results (in dataset) 1/2 • Dataset attributes • attributedatasetdatasetdataset • column123

  6. Comparing dataset metadata C.Example results (in dataset) 2/2 Variable attributes attributedatasetdatasetdataset column123 6

  7. Comparing dataset metadata D. Application of macro %CrossRef not with entirely different datasetsbut with a (limited) number of rather similar datasets to view differences master datasets and subsets of them different versions of datasets same datasets with different names similar datasets with different data Goal: to see whether more datasets could be combined into one dataset (or ignored if the data are identical) 7

  8. Comparing dataset metadata • E. Some technical information • all fields are type character of length $256, first, attribute field has $36 • internally SAS name literal variable names are applied • OPTIONS VALIDVARNAME=ANY is set, and reset to the original state at the end of the macro • variable names starting with an asterisk (*) or ending with an exclamation mark (!) and one digit. Avoid such names in your datasets and limit your variable name length to maximally 30 • WORK dataset names start with __

  9. Comparing dataset metadata • F. Future features 1/2 • comparing all datasets in one or more libraries using a wildcard (LibName.*) • optional aggregated data for both numerical and character variables • (non-deleted) logical number of observations • number of non-missing values • number of missing values • frequency distribution of a limited number of distinct (formatted) values (categories) • minimum and maximum (formatted) value(first and last non-missing character value)

  10. Comparing dataset metadata F. Future features 2/2 optional aggregated, univariate data for (mainly) numerical variables mean value median value (also approximate middle, non-missing, sorted, character value) (formatted) mode value (also most occurring non-missing character value) standard deviation various percentiles and more, e.g. distribution information and the statistics that PROC COMPARE can generate 10

  11. Comparing dataset metadata • QUESTIONS • & • ANSWERS • SASquestions@ocs-consulting.com • Jim.Groeneveld@OCS-Consulting.com • http://jim.groeneveld.eu.tf

  12. Q&A: Comparing dataset metadata SAS name literal A name expressed as a string within quotes, followed by the letter N. Applicable to variable names, statement labels and imported variable and table names from DBMS tables (e.g. Excel). Advantage: more compatibility. Example: 'This @#$name'n = 'a SAS name literal'; More information in:SAS Language Reference: Concepts. 12

  13. Q&A: Comparing dataset metadata • Straightforward inventory of metadata • save results of PROC CONTENTS (or of the CONTENTS statement of PROC DATASETS for one or more libraries) to datasets, • if desired keep the most important variables LibName, MemName, Name, Label, Type, Length, Format, FormatL, FormatD, Informat, InformL and InformD; • concatenate all metadata datasets (SET); • if desired sort by variable NAME. This generates all dataset and variable information in subsequent records.

More Related