340 likes | 356 Views
Statistical Software Programming. STAT 6360 –Statistical Software Programming. Introduction to Statistical Software Statistical Software – The Landscape Why Should You Learn SAS? Why Should You Learn R? A Question Not Worth Answering: Which one is better?.
E N D
STAT 6360 –Statistical Software Programming Introduction to Statistical Software • Statistical Software – The Landscape • Why Should You Learn SAS? • Why Should You Learn R? • A Question Not Worth Answering: Which one is better?
STAT 6360 –Statistical Software Programming Statistical Software – The Landscape There are many, many tools for statistics and data analysis. • Proprietary comprehensive “packages”: SAS, SPSS, Stata, Minitab, BMDP, Genstat, Systat, S-PLUS, JMP • Proprietary specialized packages: StatXact, Nquery Advisor, MLwiN, Mplus, LISREL, Egret • Open source/freeware programs. • Most are specialized: openBUGS, winBUGS, JAGS • Few are comprehensive: R, DAP • Tools for mathematics and mathematical programming useful for statistics: Matlab, Mathematica, Maple, Ox
STAT 6360 –Statistical Software Programming Statistical Software – The Landscape • SAS – big, expensive, powerful, much more than statistics, the big kid on the block. Data management, entrenched dominance give it an edge over many competitors. • SPSS – comprehensive, easy-to-use package popular in social sciences, but a big step behind SAS • Stata – comprehensive, popular in economics and other fields. • Minitab – One of the originals that has survived by being friendly, easy-to-use, inexpensive. • JMP – A friendly GUI-based front-end to SAS with more limited capability. • R – a programming language and environment, not a “package” per se. Free, open-source, flexible, extensible, powerful, but “user beware”.
STAT 6360 –Statistical Software Programming Why Should You Learn SAS? • SAS is an industry and government standard. Many companies and agencies like SAS, have a long-standing history of using SAS, and want job candidates to know SAS. • E.g., although it is not true that the FDA requires SAS to be used for clinical trials, it does have regulations that have led to SAS being used almost exclusively (until recently) for FDA clinical trials. In part, this is why SAS dominates in the pharmaceuticals industry. • SAS had excellent data management facilities, handles big datasets well, is very powerful, is well documented, has the strength of the SAS name behind it, provides many different tools besides statistical analysis, keeps up with the latest methods, employs very good people to keep their edge, and implements some methods especially well (e.g., the general linear model, mixed-effect models). • Cons: expensive(!), hard to learn, inflexible as a programming environment, some features are awkward/hard to use (matrix language, macros, graphics).
STAT 6360 –Statistical Software Programming Why Should You Learn R? • The popularity of R is exploding and will only continue to grow. Why? • R is free!! But there’s much more to it: • It is a more flexible programming environment – closer to a true programming language than SAS. This allows code to be much more efficiently written and, in some cases, executed. • It is extensible. Through R packages, users contribute new capabilities constantly. New methods become available in R faster than they possibly could in a commercial package. No limits to what it can do. • The open source community is constantly improving R. • R graphics are much better and easier-to-use than many commercial packages. • Much better for simulations, optimization, coding new techniques. • Cons: documentation wildly uneven; commands often non-intuitive; language structure complex, designed to be easy to use, not easy to code; some methods have errors, others haven’t been validated by the scientific/statistical community; no accountability, sometimes you get what you pay for; Wikipedia-like.
STAT 6360 –Statistical Software Programming Which is Better, R or SAS? • This is a question that will generate fierce debate and a fair bit of irrationality on both sides. There’s really no winner in this argument. • SAS is better for some purposes, R for others. Also depends on the user and his/her taste, skills, etc. • Most statisticians know and regularly use a variety of tools. Trying to get by with one piece of software (or even 2 or 3) is a mistake. • As a statistician, for data analysis I use SAS somewhat more than R, for research I use R (and other tools like Matlab) much more than SAS. Learn Both!
STAT 6360 –Statistical Software Programming Introduction to SAS • SAS runs on many operating systems (unix, mainframes, etc.), but we will learn SAS for Windows. • SAS code (programming statements) is (mostly) independent of the OS. It is our main focus. • But on most OSs, there is also a windowing environment, called the Display Manager, with which one typically (but not necessarily) interfaces with SAS. • “Way back when” SAS code was typically executed in Batch Mode where a code file (.sas file) was submitted at a command line, and results and information about the “job” (errors, etc.) were dumped into separate files (.lst and .log) files. • The SAS DM is very helpful. We’ll learn it too.
STAT 6360 –Statistical Software Programming First, An Example: pets.sas • From eLC, download the SAS file called pets.sas. • You will find this file in a folder (module) called “SAS Code”. • Save the file to your USB drive. I recommend setting up a folder structure something like this:
STAT 6360 –Statistical Software Programming Example: pets.sas If you open pets.sas in a simpler editor like Notepad, here’s what it looks like:
STAT 6360 –Statistical Software Programming But let’s look at it in SAS’s display manager, which has a much fancier editor that will help us understand the structure of the program. • Double-click the SAS icon on the desktop: • You will see the SAS DM, which looks like this:
STAT 6360 –Statistical Software Programming Now maximize the editor window and open pets.sas in the SAS editor by clicking File, then Open Program, and then find, select and open pets.sas from your USB drive. You should see this: * pets.sas ; data pets; input name $ species $ breed $ age weight gender$; datalines; Heidi dog mix 10 51 F Lexi dog mix 3 48 F Purr cat . 6 15 M Princess cat . 6 10 F ; run; procprintdata=pets; run; procfreqdata=pets; tables species*gender / norownocolnopercent; run;
STAT 6360 –Statistical Software Programming SAS programs are structured into discrete chunks: • Data steps – read data, process data, store as a SAS dataset • PROCs (procedures) – perform statistical analyses (anova, regression) or data management or reporting tasks (sorting, printing) on SAS datasets. Raw Data Read in Data Process Data (e.g., create new variables) and save as SAS dataset Data Step PROCs Analyze Data
STAT 6360 –Statistical Software Programming Example: pets.sas pets.sas does three main things: • Reads in data about my pets and stores them in a SAS dataset called pets. • Prints the dataset in the output destination. • The output destination used to be the output window (or, in batch mode, the .lst file), but now html output is the default, unless we override it, which we will do now. • Click Tools → Options → Preferences → Results . Then put a check next to “Create listing”, and uncheck “Create HTML” and “Use ODS Graphics”. Then click OK. • Creates a cross-tabulation (or contingency table) of my pets’ species by gender.
STAT 6360 –Statistical Software Programming SAS code consists of an ordered collection of statements. • Each statement must end with a semi-colon! • SAS is generally case insensitive. • Statements can start in any column, can continue across multiple lines, and need not be on different lines. • Statements consist of a mix of keywords (data, input, proc, run) and user-supplied variable names, dataset names, etc. (pets, name, species, gender) • Keywords tell what kind of statement it is (data, proc, tables) or specify options to control details (data,norow, nocol, nopercent). • Note that the first instance of data is a data statement, other datas are options to PROCs telling which dataset to operate on. • Statements are (mostly) organized into data steps and PROCs. Each data step and PROC ends withrun;
STAT 6360 –Statistical Software Programming Example: pets.sas Let’s run our program! To run (or submit) our entire SAS program, click on the running man icon: The program runs, and the DM switches to the Output window, which shows the results we asked for: a print-out of the dataset pets (p.1) and a cross-tab of species by gender (p.2). Before becoming too enamored with our results we should.. ALWAYS LOOK AT THE LOG WINDOW FIRST! • The log tells us what happened when we submitted the program. What happened, what went wrong (ERRORs), and other useful information.
STAT 6360 –Statistical Software Programming The Log Window The log contains: • Notes • What version of SAS did I use, what datasets did I use or create and how big were they, how much time did it take. • Warnings • Missing values were created, misspelled words that SAS could figure out, etc. • Errors • In red, so easy to find. • Missing semi-colons(!), bad syntax, etc. • If you have errors, fix your code before looking at the results, which may be gibberish. Keep in mind, bad code sometimes runs without errors. No errors doesn’t mean your results aren’t still gibberish!
STAT 6360 –Statistical Software Programming Example: pets.sas • The data statement creates a new dataset, which we’ve chosen to call pets. • Where does it live? SAS datasets are organized into what it calls libraries. By default, datasets are stored in the WORK library, which is temporary and is deleted at the end of our SAS session. We’ll come back to libraries. • SAS datasets can read in data from external files in a wide variety of ways, create variables and data of their own, or, as in pets.sas, read data from data lines included in the data step itself. • Regardless of where the data come from, input tells which variables to read in, and what type of data SAS should expect for each variable. • The $ tells SAS to expect a character variable (e.g., species). By default it expects a numeric variable (e.g., age).
STAT 6360 –Statistical Software Programming Variable names: • Between 1 and 32 characters in length. Can start with a letter (A-Z) or underscore, cannot contain special characters besides underscores. Variable types: • Numeric or character. • Values for numeric variables can include +, -, e, or E (scientific notation). • Variables whose values contain numbers can be treated as character variables, but not vice versa. • Character variables can have values that are up to 32,767 characters long. • A variable names must be followed by $ (immediately or after space(s)) to be treated as character variable. • Other types of data (e.g., dates and times) are treated as numeric or character too, with special tools (informats and formats) to handle their special nature.
STAT 6360 –Statistical Software Programming Example: pets.sas • The datalinesstatement tells SAS that data that should be read in follow immediately. • Synonym is cards. • In pets.sas, we use perhaps the simplest type of input, list input. • In list input, the data are separated by spaces. SAS looks for a value for each variable, in order, going to the next line if necessary. Values for all variables must be given, or the variables and their values will become mismatched. • Most simply, if there are x variables and n subjects, you should put x values per line, separated by spaces, for each of n lines. • A semi-colon should appear at the end of all of the data. • Missing values should be indicated by a period (.). For numeric variables, a missing value will printed as a period. For character variables it will be printed as a blank.
STAT 6360 –Statistical Software Programming Example: pets.sas • The run statement is used to end the data step and to tell SAS to execute it. It has a similar role for PROCs. • Actually, a run statement is not strictly necessary for a data step. The data step will also end and execute if SAS reaches a new data statement or a procstatement. • I strongly recommend using run at the end of every data step and proc, however. It makes your code easier to read and understand. • Notice also that I have indented all statements between data and run. • This indentation is just style not syntax; it is not necessary as far as SAS is concerned. • But it makes a program much more readable and its structure is clear at just a glance. • I strongly recommend you follow this style of indenting subordinate statements.
STAT 6360 –Statistical Software Programming SAS’ Built-in Loop SAS processes a data step in a very structured way: line by line and observation by observation. A diagram of execution flow (from D&S): • This structure can be useful and powerful, but it also very constraining!
STAT 6360 –Statistical Software Programming Example: pets.sas • proc print; and run; tell SAS to run the print procedure, which prints out a dataset. • I have used the data= option. Without it, proc print will just print the most recently created dataset. With it, I tell SAS explicitly which dataset to operate on. • I strongly encourage you to always use the data= option on your procstatements to avoid confusion. • By default, proc print will print all variables and all observations in the dataset, but that behavior can be modified (e.g., to just print certain variables). • Notice SAS datasets have a rectangular structure (shaped as a matrix): columns are variables, rows are observations. • By default, proc print adds an Obs column to the print-out (runs from 1 to n, where n=number of observations).
STAT 6360 –Statistical Software Programming Example: pets.sas • procfreq; and run; tell SAS to run the freq procedure, which produces cross-tabulations (a.k.a. contingency tables) and does various types of analyses for categorical data. • The tables statement asks for a two-way contingency table giving the frequency of every combination of species and gender in the dataset. • In PROCs, statements often have options, which appear after a front-slash (/). In this case, the norow, nocoland nopercent options tell SAS not to include certain output that would appear in the contingency table by default (relative frequencies of different types). • Finally, the 1st line is a comment. Any line that starts with an asterisk is a comment that continues until a semi-colon is reached. This comments just identifies the name of the program.
STAT 6360 –Statistical Software Programming The SAS Display Manager (DM) The editor, log and output windows are essential. But other windows and tools in the DM can be awfully useful too. The Windows: • Enhanced editor. • The window that contains your code. It color-codes your syntax and allows you to collapse or expand chunks of code. Very useful. • There is also a (non-enhanced) program editor. No color-coding. Don’t use it! • Log • Output • Results – Lists the sections of your output. Allows quick access instead of paging through the output window.
STAT 6360 –Statistical Software Programming The SAS Display Manager (DM) The Windows: • Explorer – Similar to Windows Explorer. Gives access to your files from within SAS and to your SAS libraries. • Access to libraries is only important use. • Libraries contain your SAS datasets. • We’ve mentioned the WORK library. We’ll talk about how to make and use other libraries soon. • Results Viewer – if you create html output (remember, we turned this feature off), or output in pdf, rtf or some other format, the Results Viewer pops up to display that output. • The output window shows results in text format. The results viewer displays your output in html, pdf or whatever other format you asked for.
STAT 6360 –Statistical Software Programming Viewing a Dataset with Viewtable Through the explorer window you can explore your libraries and look at individual SAS datasets you have created. • This is often more convenient than using proc print. Double-click Libraries in the explorer window. This shows the datasets that have been created. Here, pets is the only one. Double-click it. Double-click your WORK library
STAT 6360 –Statistical Software Programming Viewing a Dataset with Viewtable Now the dataset pops up in a new window called viewtable. • From viewtable you can see if your dataset has been created the way you intended, check for mistakes, etc. • Note that, by default, viewtable shows the variable labels in the column heading, not the variable names. These are not always the same and you may want to switch to column names in the View pull-down menu. • Variable labels and names may differ if you have defined some labels with a label statement or, sometimes, if SAS has created the dataset it uses variable labels.
STAT 6360 –Statistical Software Programming More on Viewtable • Having your dataset open in viewtable while continuing to run code and work with that dataset can sometimes result in errors, so always close viewtable before returning to work on your program. • Viewtable can be used to enter or edit data in your dataset, change variable properties (add labels, change the length of character variables, …), sort your data, etc. • I strongly discourage this for most applications. It is usually better to do such operations through your SAS code. That way, changes you make are traceable. Your code is a record of what you did to the data to get your results.
STAT 6360 –Statistical Software Programming Other Features of the DM Clears the current window. Often best to clear log and output before submitting code. Print button. Often better not to print output directly from SAS. Break button. This interrupts your code’s execution. Useful if taking too long and/or you suspect an error. Command bar. Useless as far as I can tell. Submit a selected portion of your code or whole editor window if nothing selected (highlighted). Help button. Help facility within SAS convenient, but sometimes sluggish and better to access it through a web browser. Standard open file, save file buttons.
STAT 6360 –Statistical Software Programming Example: pets2.sas Let’s make some improvements to pets.sas. Open pets2.sas in SAS and take a look. • Use of a header - Program name, author, purpose, etc. Use them! • Two ways to form comments. • DM statement – code to control the DM instead of point and click. • options statement. Several useful system options. • ls (line size), ps (page size in terms if number of lines), date, nodate, number, nonumber, pageno, formdlim (replaces a page break with a line of characters that can be specified with this option). • title ‘My Title’; – creates a title for your output. Use single or double quotes. A plain title; command clears the title. title2, title3, footnote, footnote2, etc. work similarly. • libname statement – defines a libref, which is a value that points to a SAS data library (e.g., WORK, or one you create with the libnamestatement).
STAT 6360 –Statistical Software Programming Librefs and Permanent SAS Datasets • In pets2.sas, I’ve defined a library in which to store SAS datasets. I’ve given that library the name (or libref), SASdata, and identified it with a folder on my hard drive (or USB drive). • Librefs must be names of 8 characters or less. “SASdata” is just an example. • To save a dataset to the new SASdata library instead of the WORK library, use a two-level dataset name of the form libref.dsname • Now, when SASdata.pets is created, it is saved as a file called pets.sas7bdat in the path I have specified in the libref. • Permanent SAS dataset files have the .sas7bdat extension. They are not plain text files and can only be read by SAS and by file translation utilities (e.g., in other programs like R).
STAT 6360 –Statistical Software Programming Example: pets2.sas More new features: • Infilestatement – points to a file from which to read the data, instead of using datalines. • The firstobsoption specifies the line on which to start reading data. Default is line 1, but pets.dat has variable names on line 1, so skip these. • Sill using list input here. • The file name and path in which to find it can be specified in quotes on the infilestatement, or on a separate filename statement that associates a fileref with a particular file (e.g., a raw data file) on your hard drive. E.g., this works too: filename petsdata 'MyDataPath\pets.dat'; data SASdata.pets; infile petsdata firstobs=2; input name $ species $ breed $ age weight gender $; run; A fileref (or shortcut) to pets.dat
STAT 6360 –Statistical Software Programming Example: pets2.sas More new features: • varstatement in proc print – controls which variables from the dataset to print and in what order. Run the program in SAS, check the log and the output. • The log looks somewhat different with info about the assignment of our libref, and our use of the infilestatement. Should be no errors. • The output differs from that of pets.sas only in the order of the variables used by proc print. • Nothing new in what we’ve done, we’ve just used some new features in SAS.