240 likes | 361 Views
优化输出和创建新变量. Dr. Baokun Li 经济实验教学中心 商务数据挖掘中心. 优化 SAS 输出. TITLE 语句 – 为 Proc 程序块的输出作标记 LABEL 语句 – 为变量名称作标记 FORMAT 语句 – 为变量的数值作标记. 标准输出 The FREQ Procedure Cumulative Cumulative
E N D
优化输出和创建新变量 Dr. Baokun Li 经济实验教学中心 商务数据挖掘中心
优化 SAS 输出 • TITLE 语句 – 为Proc程序块的输出作标记 • LABEL 语句 – 为变量名称作标记 • FORMAT 语句 – 为变量的数值作标记
标准输出 The FREQ Procedure Cumulative Cumulative clinic Frequency Percent Frequency Percent ----------------------------------------------------------- A 18 18.00 18 18.00 B 29 29.00 47 47.00 C 36 36.00 83 83.00 D 17 17.00 100 100.00 优化输出 Number of Patients by Clinic The FREQ Procedure Clinical Center Cumulative Cumulative clinic Frequency Percent Frequency Percent ---------------------------------------------------------------- Birmingham 18 18.00 18 18.00 Chicago 29 29.00 47 47.00 Minneapolis 36 36.00 83 83.00 Pittsburgh 17 17.00 100 100.00
标准输出 The FREQ Procedure Cumulative Cumulative sebl_6 Frequency Percent Frequency Percent ----------------------------------------------------------- 1 70 70.00 70 70.00 2 23 23.00 93 93.00 3 6 6.00 99 99.00 4 1 1.00 100 100.00 优化输出 The FREQ Procedure Patient Report Headaches Cumulative Cumulative sebl_6 Frequency Percent Frequency Percent ------------------------------------------------------------- None 70 70.00 70 70.00 Mild 23 23.00 93 93.00 Moderate 6 6.00 99 99.00 Severe 1 1.00 100 100.00
TITLE 语句 • PROCFREQDATA=tdata; • TABLES clinic group sex educ sebl_1 sebl_6; • TITLE'Distribution of Selected Variables'; • TITLE2'on the TOMHS Dataset' ; • RUN; • TITLE 语句可以放在程序的任何地方。最好是放在过程 • 步里面 • title可随时改变 • 使用格式为:TITLE(n)‘文字'
Label 语句 LABEL clinic = 'Clinical Center'; LABEL group = 'Drug Treatment Group'; LABEL educ = 'Highest Education Attained'; LABEL sebl_1 = 'Patient Report Drowsiness'; LABEL sebl_6 = 'Patient Report Headaches'; Label 语句可以放在数据步或过程步内的任何地方(但不可在中间)。
Format 语句 • FORMAT brthdate mmddyy10. ; • FORMAT clinic $clinic. ; • FORMAT group group. ; • FORMAT fever headache se. ; • 告诉SAS 根据format(格式)显示变量的数值 • Format 语句可以放在数据步或过程步内的任何地方 • 既有内置format (例如日期),也有用户定义的format. • 一个format 可用于多个变量 • Format以句号结束 (.) • 字符型format以$开始
如何定义FORMAT PROCFORMAT; VALUE group 1 = 'Beta Blocker' 2 = 'Calcium Channel Blocker' 3 = 'Diuretic' 4 = 'Alpha Blocker' 5 = 'ACE Inhibitor' 6 = 'Placebo'; VALUE gender 1 = 'Men'2='Women' ; VALUE se 1 = 'None'2 = 'Mild’ 3 = 'Moderate'4 = 'Severe'; Format名称并不必是数据集合内一个变量的名称 Format名
PROCFORMAT; VALUE $clinic 'A' = 'Birmingham' 'B' = 'Chicago' 'C' = 'Minneapolis' 'D' = 'Pittsburgh' ; 请不要混淆format和将要被格式化的变量 Sas单从PROC FORMAT不能知道你想要格式化哪个变量。你需要用format语句对变量进行格式化。
* 定义的Format没有使用; PROCFREQ; TABLES clinic sebl_6; RUN; ========================================== * 使用了format; PROCFREQ; TABLES clinic sebl_6; FORMAT clinic $clinic. sebl_6 se. ; RUN;
Program 6 PROC FORMAT; ... DATA tdata ; INFILE 'd:\data\tomhs3.dat'; INPUT ptid $ clinic $ group sex educ eversmk alcbl sebl_1 sebl_6 ; run; LABEL clinic = 'Clinical Center'; LABEL group = 'Drug Treatment Group'; LABEL educ = 'Highest Education Attained'; LABEL sebl_1 = 'Patient Report Drowsiness'; LABEL sebl_6 = 'Patient Report Headaches'; LABEL alcbl = 'Alcoholic Drinks Per Week'; LABEL eversmk = 'Ever Smoke Cigarettes'; PROCFREQ DATA=tdata; TABLES clinic sebl_6; FORMAT clinic $clinic. sebl_6 se. ;
PROCMEANSDATA=tdata NMEANSTD; VAR alcbl; CLASS eversmk; ; FORMAT eversmk smoke. ; TITLE'PROC MEANS With Variable and Value Labels'; RUN; The MEANS Procedure Analysis Variable : alcbl Alcoholic Drinks Per Week Ever Smoke N Cigarettes Obs N Mean Std Dev ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Smoker 48 47 5.3829787 6.4892995 Non-smoker 52 52 3.5384615 4.6292401 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
需掌握的要点 • Format在使用之前需先定义 (PROC FORMAT). • FORMAT 语句可应用定义的format. • 数据步里的Label 和 format 语句可用于其后的 PROCs • 过程步里面的Label 和 format 语句只能用于该过程步 PROC本身
为网络或Word/Excel文件创建HTML 表 ODSLISTINGCLOSE; ODSHTMLFILE='educ.html'; PROCFREQDATA=tdata ; TABLES educ; FORMAT educ educ.; TITLE'HTML Output From PROC FREQ'; RUN; ODSHTMLCLOSE; The output will go to the html file rather than the .lst file. This can then be viewed on the web or inserted into word. Can also create RTF files. Note: Run statement is needed here, and ODS statement must be after run.
教育级别变量educ的解释 创建一个新变量 grad1= 1 如果大学毕业 = 2 否则
Program 7 DATA tdata; INFILE'/home/ph5420/data/tomhs.data' ; INPUT @ 1 ptid $10. @ 49 educ 1. @123 sbp12 3. ; * 这样的话,缺失值被变换为2; if educ < 7then grad1 = 2 ; else if educ >=7then grad1 = 1 ; * 下面的两种方法结果一样且是正确的; if educ < 7 and educ ne .then grad2 = 2; else if educ >=7then grad2 = 1; * IN 是SAS的一个特殊函数; if educ IN(1,2,3,4,5,6) then grad3 = 2; else if educ IN(7,8,9) then grad3 = 1;
PROCFORMAT; VALUE grad 1-6 = '< College Graduate' 7-9 = 'College Graduate'; VALUE sbpcat LOW - 119 = '< 120' 120-139 = '120-139' 140-HIGH = '140+' ; 注意: LOW 不包括缺失值
PROCFREQDATA=tdata; TABLES grad1 grad2 grad3 educ ; FORMAT educ grad.; RUN; grad1 Frequency Percent ---------------------------------- 1 43 43.00 2 57 57.00 grad2 Frequency Percent ---------------------------------- 1 43 43.43 2 56 56.57 Frequency Missing = 1 grad3 Frequency Percent ---------------------------------- 1 43 43.43 2 56 56.57 Frequency Missing = 1 educ Frequency Percent ----------------------------------------------- < College Graduate 56 56.57 College Graduate 43 43.43 最初的变量值为 1-9 缺失值被变成2
* 把变量sbp12 变换成 3 个级别; if sbp12 = .then sbp12c = . ; else if sbp12 < 120then sbp12c = 1 ; else if sbp12 < 140then sbp12c = 2 ; else if sbp12 >=140then sbp12c = 3 ; With if-then-else definitions SAS stops reading after the first true statement
PROCFREQDATA=tdata; TABLES sbp12c sbp12; FORMAT sbp12 sbpcat.; RUN; OUTPUT The FREQ Procedure Cumulative Cumulative sbp12c Frequency Percent Frequency Percent ------------------------------------------------------------ 1 36 39.13 36 39.13 2 43 46.74 79 85.87 3 13 14.13 92 100.00 Frequency Missing = 8 Cumulative Cumulative sbp12 Frequency Percent Frequency Percent ------------------------------------------------------------ < 120 36 39.13 36 39.13 120-139 43 46.74 79 85.87 140+ 13 14.13 92 100.00 Frequency Missing = 8
How come no values of 2 and why so many missing? * 代价惨重的简单错误; if sbp12 = .then sbp12c = . ; else if sbp12 < 120then sbp12c = 1 ; else if sbp12 < 140then sbp12 = 2 ; else if sbp12 >=140then sbp12c = 3 ; PROCFREQDATA=tdata; TABLES sbp12c; RUN; The FREQ Procedure Cumulative Cumulative sbp12cat Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 36 73.47 36 73.47 3 13 26.53 49 100.00 Frequency Missing = 51
创建新变量时需知道: • 新变量的初始值是缺失值; • 2. 缺失值 < 任何值; • 3. 数值变量缺失值的指代值为.; • 4. 类别变量缺失值的指代值为‘ ’; • if sbp = . then ... • if clinic = ' ' then ...
如何检查新建立的变量是对的? • 显示新变量和原变量 • PROCPRINTDATA=temp (OBS=20); • VAR educ college ; • 在新变量和原变量上运行PROC MEANS,确保两个变量的缺失值个数相同. • PROCMEANSDATA=temp; • VAR educ college; • 在新变量和原变量上运行PROC FREQ. • PROCFREQDATA=temp; • TABLES educ college;