270 likes | 587 Views
How to Begin Using Stata. Lisa Kaltenbach, MS. Biostatistician II Department of Biostatistics Vanderbilt University Lisa.kaltenbach@vanderbilt.edu.
E N D
How to Begin Using Stata Lisa Kaltenbach, MS. Biostatistician II Department of Biostatistics Vanderbilt University Lisa.kaltenbach@vanderbilt.edu
The information in this presentation has been gathered from many sources. None of the ideas in this lecture are my original work, although I may have organized them in an original way.
Outline • Introduction. • Menus vs. commands. • Window layout. • Inputting Data. • Data Management. • Variable labels. • Variable value lables. • Deriving new variables. • Sorting rows by columns. • Deleting observations/variables. • File Management. • Do-files. • Log files. • Stata Help.
Introduction: Why use Stata? According to www.stata.com: • Stata is a complete, integrated statistical package that provides everything you need for data analysis, data management, and graphics. • Fast, accurate, and easy to use: • With a point-and-click, an intuitive command syntax, and online help, • All analyses can be reproduced and documented for publication and review. • Broad suite of statistical capabilities: • Hundreds of statistical tools at your fingertips, from advanced techniques, such as survival models with frailty, dynamic panel data (DPD) regressions, generalized estimating equations (GEE), models with sample selection, ARCH, and estimation with complex survey samples; to linear and generalized linear models (GLM), regressions with count or binary outcomes, ANOVA/MANOVA, ARIMA, cluster analysis, standardization of rates, and case–control analysis; to basic tabulations and summary statistics. • Complete data-management facilities: • You can combine and reshape datasets, manage variables, and collect statistics across groups or replicates. • You can work with byte, integer, long, float, double, and string variables. • Stata also has advanced tools for managing specialized data such as survival/duration data, time-series data, panel/longitudinal data, categorical data, and survey data. • Publication-quality graphics: • You can choose between existing graph styles or create your own. • Responsive and extensible: • Stata is so programmable that developers and users add new features every day to respond to the growing demands of today's researchers. • Matrix programming—Mata: Mata is both an interactive environment for manipulating matrices and a full development environment that can produce compiled and optimized code.
Introduction: Why use Stata? [2] • Cross-platform compatible: • Stata is available for Windows, Macintosh, and Unix computers (including Linux). • Stata datasets, programs, and other data can be shared across platforms without translation. • Complete documentation and other publications: • Comes with a base manual. • On-line documentation. • Publishes a journal and news quarterly. • Technical support and learning resources: • Free to registered users of Stata. • Stata provides online training through NetCourses. • You can also participate in short courses sponsored by Stata or third parties in various locations. • Widely Used: • Stata is distributed in more than 150 countries and is used by professionals in many fields. • Affordable: • Stata offers several purchase options to fit your budget.
Menus vs. Commands • Stata has a set of pull-down menus of commands. • Allows user to get results without needing to know syntax. • Alternatively, command syntax allows user to reproduce results easily. • Convenient if your datasets are updated repeatedly.
Window Layout • Stata has 5 windows. • Command: where commands are entered. • All commands and variables are case sensitive. • Results: where results appear. • Review: where past commands are listed. • Clicking a past command in Review window brings it to the command window where it can be modified and re-executed. • Graph: where graphs are displayed (appears only when graphs are requested). • Variable: where variables in current dataset are listed.
Inputting Data • Many Options: • Manually enter data into the Stata Data Editor. • Copy data into the Data Editor from another source (ex.: Excel). • Importing an ASCII (text) file. • Reading in an Excel spreadsheet (tab- or comma- delimited text file). • Open existing Stata Data file. • Common file extension: .dta. • Use a conversion package (eg, StatTransfer or DBMSCopy) to read in data from another package (eg, SAS data file).
Manually Input Data • Open the Data Editor by: • Clicking on Data Editor icon (4th from right on tool menu bar, looks like a data file). • Via command: edit • Can enter numbers or text (appears red). • To define variable names: • Note: variables are automatically named var1, var2, … • Double-click on top of column to view/edit “Variable Properties” and change the name. • Via command: rename oldvarname newvarname • Eg. rename var1 id
Copy Spreadsheet Data • To copy data into Data Editor from an MS Excel spreadsheet: • Open Spreadsheet with data. • Highlight and copy cells of interest. • Paste in Data Editor (via Edit menu, right-click, toolbar icon, or keyboard shortcut) in 1st cell (row and column), where you want the data to begin. • To save datafile: • Via drop-menu: File → Save As … • Via command: savepathname/datafilename.dta
Import ASCII File • Via drop- menu: File → Import → Unformatted ASCII data → (add variable names) • Note: After importing data by clicking on icons, the commands for importing the file are in the review window. • Via command: infile id age using "C:\Documents and Settings\kaltenla\My Documents\Desktop\SampleData.dat"
Import a Spreadsheet • Via drop- menu: File → Import → ASCII data created by a spreadsheet. • Browse to find your file and click on the type of delimiter (eg, tab, comma). • Via command: • Comma delimited file: • insheet using "C:\Documents and Settings\kaltenla\Desktop\SampleData.csv", comma • Tab delimited file: • insheet using "C:\Documents and Settings\kaltenla\Desktop\SampleData.txt", tab
Opening an Existing Stata Datafile • Via drop-menu: File → Open →Scroll to find data • Via command: use "C:\Documents and Settings\kaltenla\My Documents\Work\pbc.dta“, clear • Eg, The Primary Biliary Cirrhosis data set (available from the Dataset Archive on the StatLib website (http://www.stat.cmu.edu/)). • The clear command is a default command that clears the memory before loading the requested datafile. • This is necessary because Stata can have only one dataset in memory at a time!
Listing Data • The describecommand lists the variables, labels, formats, storage type, number of observations, and date file was created. • describe • The list command lists rows and columns of the data file. • list id chol album bili in 1/6 • Allows you to view only the variables id, chol, album, and bili for the first 6 observations only • Suppose we are interested in looking at the histologic stage of disease (stage) and treatment (drug) for males. • listvarlistifcondition lists the variables specified in varlist, restricting to those observations satisfying condition. • list stage drug if sex==0 • Eg, For males with cholesterol greater than or equal to 370: • list stage drug if sex==0 & chol >= 370 • The syntax for condition is: < less than <= less than or equal to == equal to >= greater than or equal to ~= not equal to & and | or
Data Management: Variable Labels • For example: you want to label bili column as “Bilirubin mg/dl”. • Via drop-menu: Data → Labels → Label variable • Via command: label variablecreates a variable label. • label variable bili “Bilirubin mg/dl” • To remove the label use command without the label: • label variable bili
Data Management: Variable Value Labels • Suppose the variable sex =0 for males, =1 for females. • Whenever we list the variable sex, we see the levels 0 and 1. • We can create labels for these data values so the output will display “male” for 0 and “female” for 1. • Very convenient when dealing with variables that you are unfamiliar with, large data sets, or have many levels. • A two step process: • Create labels. • Assign labels to values.
Data Management: Variable Value Labels [2] • Create and assign variable value labels: • Via command: labeldefine creates labels for data values. • label define sexlab 0 “male” 1 “female” • Via command: label values assigns a label to the values of a variable. • label values sex sexlab • To remove variable value labels: • label values sex Or • label drop sexlab • Via drop- menu: Data → Labels → Label values → Define or modify value labels • Via drop- menu: Data → Labels → Label values → Assign value labels to variable
Data Management: Sorting Data • Suppose we want to sort the data by age and serum cholesterol (mg/dl). • The sort command allows you to sort the rows of a data set by one or more variables (columns). • sort age chol • Nice for listing and summarizing data. • Eg,sort edema by edema: summarize
Data Management: Derive New Variables • Suppose we want to create a new variable “anyedema” for the presence of any edema. • In our data set the variable edema=0 if No edema and no diuretic therapy, 0.5 = Edema present without diuretics or edema resolved by diuretics, and 1 = Edema despite diuretic therapy. • i.e. want to collapse to: 0 if edema=0 and 1 for edema= {.5, 1}. • Use the generate and replace commands in conjunction: • Give the new variable an initial value: • generate anyedema=0 • Replace the initial value where needed. • replace anyedema=1 if edema==.5 | edema==1 Or • replace anyedema=1 if edema>0 • Good idea to check to make sure the new variable was coded correctly by cross-tabulation of anyedema by edema • tabulate generates frequency distribution table. • tabulate anyedema edema • Via drop- menu: Statistics → Summaries, tables, & tests → Tables → Two-way tables with measures of association.
Data Management: Delete variables/observations • The drop command deletes specified variables. • drop bili chol drug • Can also drop a subset of observations by incorporating a conditional expression. • drop if sex==0
File Management: Using a Do-file • A do-file is a text (also called batch) file with a series of commands to be executed in order by Stata. • Also great for composing, revising, and saving Stata commands. • To use a do-file: • Click on Do-File Editor. • Enter commands. • Save file with .do extension. • To execute a do-file: • Via command: do pathoffile/filename.do. • Via drop- menu: File → Do …
File Management: Log files • Can be used to record (and print): • Executed commands. • Resulting output (except for graphs). • Recommend that the first thing you do in Stata is open a log file. • Two types of Log files: • Unformatted Log files: • Lacks formatting, but is simpler to use if you plan to insert and edit in text editor. • Common file extension: .log. • Formatted Log files: • “Stata Markup and Control Language” file. Great for viewing and printing within Stata. • Common file extension: .smcl. • To open a Log file: • Via drop-menu:File → Log → Begin… • Via toolbar: Click on the 4th icon from left on menu bar (looks like a scroll)
Stata On-line • If you are connected to the Internet, you are also connected to the Stata website (www.stata.com) whenever you run Stata. • New Stata programs can be downloaded from their website onto your computer.
Stata Help • Can use the help command command to open a window with documentation regarding command(eg, help reg). • Via drop- menu: • Help →Contents for a list of Stata commands in table-of-contents format. • Help →Search… for a keyword search. • Help →Stata Command…to search for specific Stata commands. • If you are connected to the Internet and running Stata, when conducting a search you are searching both the Stata software and the Stata website.
Stata Resources • Statistical Modeling for Biomedical Researchers by W. Dupont, • An Introduction to Stata for Health Researchers by Svend Juul • The Stata News is a quarterly publication containing announcements of new releases and updates, NetCourse schedules, new books, Users Group meetings, new products, and other announcements of interest to Stata users. • Stata Press also publishes books about using Stata and about statistics topics for professional researchers of all disciplines. • The Stata Journal is a quarterly publication containing articles about statistics, data analysis, teaching methods, and effective use of Stata’s language. • http://www.stata-journal.com
Thanks to: • Dr. Patrick Arbogast’s for his Biostatistics I lecture notes. • Terri Scott for her constructive criticism. • All of you for attending. • GCRC for the opportunity to present this seminar and breakfast.