SAS Annotated Output GLM

SAS Annotated Output GLM

在使用SAS过程中,proc glm步输出离差平方和有4种算法,分别是SS1 SS2 SS3 SS4

下面文章介绍了其中SS3的具体计算步骤和例子。

This page shows an example of analysis of variance run through a general linear model (glm) with footnotes explaining the output. The data were collected on 200 high school students, with measurements on various tests, including science, math, reading and social studies. The response variable is writing test score (write), from which we explore its relationship with gender (female) and academic program (prog). The model examined has the main effects of female and program type, as well as their interaction. The dataset used in this page can be downloaded fromhttp://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm.

The syntax for the page is provided below. The class statement defines which variables are to be treated as categorical variables in the modelstatement. The model statement has the main effects of female and prog, as well as their interaction; the interaction is specified by taking the product of the two main effect terms. The option ss3 tells SAS we want type 3 sums of squares; an explanation of type 3 sums of squares is provided below.

proc glm data = "c:\temp\hsb2";
 class female prog;
 model write = female prog female*prog /ss3;
run; quit;The GLM Procedure

   Class Level Information

Class         Levels    Values
female             2    0 1
prog               3    1 2 3

Number of Observations Read         200
Number of Observations Used         200

Dependent Variable: write 

                                        Sum of
Source                      DF         Squares     Mean Square    F Value    Pr > F
Model                        5      4630.36091       926.07218      13.56    <.0001
Error                      194     13248.51409        68.29131
Corrected Total            199     17878.87500

R-Square     Coeff Var      Root MSE    write Mean
0.258985      15.65866      8.263856      52.77500

Source                      DF     Type III SS     Mean Square    F Value    Pr > F
female                       1     1261.853291     1261.853291      18.48    <.0001
prog                         2     3274.350821     1637.175410      23.97    <.0001
female*prog                  2      325.958189      162.979094       2.39    0.0946

Class Level Information

   Class Level Information

Class

a

        Levels

b

   Values

c

female             2    0 1
prog               3    1 2 3

Number of Observations Read

d

        200
Number of Observations Used

d

        200

a. Class - Underneath are the categorical (factor) variables, which were defined as such in the class statement. Had the categorical variables not been defined in the class statement and just entered in the model statement, the respective variables would be treated as continuous variables, which would be inappropriate.

b. Levels - Underneath are the respective number of levels (categories) of the factor variables defined in the class statement.

c. Values - Underneath are the respective values of the levels for the factor variables defined in the class statement.

d. Number of Observations Read and Number of Observations Used - This is the number of observations read and the number of observation used in the analysis. The Number of Observations Used may be less than the Number of Observations Read if there are missing values for any variables in the equation. By default, SAS does a listwise deletion of incomplete cases.


Model Information

Dependent Variable

e

: write  

                                        Sum of
Source

f

                     DF

g

        Squares

h

     Mean Square

i

   F Value

j

   Pr > F

j

Model                        5      4630.36091       926.07218      13.56    <.0001
Error                      194     13248.51409        68.29131
Corrected Total            199     17878.87500

R-Square

k

    Coeff Var

l

     Root MSE

m

    write Mean

n

0.258985      15.65866      8.263856      52.77500

Source

o

                     DF

p

    Type III SS

q

     Mean Square

r

   F Value

s

   Pr > F

s

female                       1     1261.853291     1261.853291      18.48    <.0001
prog                         2     3274.350821     1637.175410      23.97    <.0001
female*prog                  2      325.958189      162.979094       2.39    0.0946

e. Dependent Variable - This is the dependent variable in our glm model.

f. Source - Underneath are the sources of variation of the dependent variable. There are three parts, Model, Error, and Corrected Total. With glm, you must think in terms of the variation of the response variable (sums of squares), and partitioning this variation. The variation in the response variable, denoted by Corrected Total, can be partitioned into two unique parts. The first partition, Model, is the variance in the response accounted by our model (female prog female*prog). The second source, Error, is the variation not explained by the Model. These two sources, the explained (Model), and unexplained (Error), add up to the Corrected Total, SSCorrected Total = SSModel + SSError.

The term "Corrected Total" is called such, as compared to "Total", or more correctly, "Uncorrected Total," because the "Corrected Total" adjusts the sums of squares to incorporate information on the intercept. Specifically, the Corrected Total is the sum of the squared difference between the response variable and the mean of the response variable, whereas the Uncorrected Total is the sum of the squared values of just the response variable.

g. DF - These are the degrees of freedom associated with the respective sources of variance. As with the additive nature of the sums of squares, the degrees of freedom are also additve, DFCorrected Source = DFModel + DFError. The DFCorrected Total has N-1 degrees of freedom, where N is the total sample size. See DF, superscript p, for the calculation of the DF for each individual predictor variable, which add up to DFModel. Hence, DFError=DFCorrected Total - DFModel. The DFModel and DFError define the parameters of the F-distribution used to test F Value, superscript j.

h. Sum of Squares - These are the sums of squares that correspond to the three sources of variation. 
SSModel - The Model sum of squares is the squared difference of the predicted value and the grand mean summed over all observations. Suppose our model did not explain a significant proportion of variance, then the predicted value would be near the grand mean, which would result with a small SSModel, and SSError would nearly be equal to SSCorrected Total
SSError - The Error sum of squares is the squared difference of the observed value from the predicted value summed over all observations. 
SSCorrected Total - The Corrected Total sum of squares is the squared difference of the observed value from the grand mean summed over all observations.

i. Mean Square - These are the Mean Squares (MS) that correspond to the partitions of the total variance. The MS is defined as SS/DF.

j. F Value and Pr > F - These are the F Value and p-value, respectively, testing the null hypothesis that the Model does not explain the variance of our response variable. F Value is computed as MSModel / MSError, and under the null hypothesis, F Value follows a central F-distribution with numerator DF = DFModel and denominator DF =DFError. The probability of observing an F Value as large as, or larger, than 13.56 under the null hypothesis is < 0.0001. If we set our alpha level at 0.05, our willingness to accept a Type I error, we‘d reject the null hypothesis and conclude that our model explains a statistically significant proportion of the variance.

k. R-Square - This is the R-Square value for the model. R-Square defines the proportion of the total variance explained by the Model and is calculated as R-Square = SSModel/SSCorrected Total = 4630.36/17878.88=0.259.

l. Coeff Var - This is the Coefficient of Variation (CV). The coefficient of variation is defined as the 100 times root MSE divided by the mean of response variable; CV = 100*8.26/52.775 = 15.659. The CV is a dimensionless quantity and allows the comparison of the variation of populations.

m. Root MSE - This is the root mean square error. It is the square root of the MSError and defines the standard deviation of an observation about the predicted value.

n. write Mean - This is the grand mean of the response variable.

o. Source - Underneath are the variables in the model. Our model has femaleprog, and the interaction of female and prog. The interaction disallows the effect of, say, prog, over the levels of female to be additive. Also, our model follows the hierarchical principal, i.e., if an interaction term is in the model (female*prog), the lower order terms (female and prog) must be included. Further, when there is a significant interaction in the model, the main effects (the lower order terms) are difficult to interpret. If the interaction term is not statistically significant, some would advise dropping the term and rerunning the model with just the main effects, so that the main effects would have an unambiguous meaning. The traditional anova approach would leave the nonsignificant interaction in the model and interpret the main effects in the normal manner. If the interaction term is found statistically significant, one would leave the model as is and evaluate the simple main effects.

p. DF - These are the degrees of freedom for the individual predictor variables in the model. From the class level information section, the lower order term DF is given by the number of levels minus one. For example, female as two levels, therefore DFfemale = 2-1=1. Also, prog has three levels and DFprog = 3-1=2. For the interaction term, DFfemale*prog = DFprog* DFfemale = 1*2 =2. The DF of the predictor variables, along with the DFError, define the parameters of the F-distribution used to test the significance of F Value, superscript s.

q. Type III SS - These are the type III sum of squares, which are referred to as partial sum of squares. For a particular variable, say female, SSfemaleis calculated with respect to the other variables in the model, prog and female*prog. Also, we showed earlier that SSCorrected Total = SSModel + SSError, and we might expect that SSModel = SSfemale + SSprog+ SSprog*female; however, this is generally not the case (this is only true for a balanced design).

r. Mean Square - These are the mean squares for the individual predictor variables in the model. They are calculated as SS/DF, and along MSError, they are used to calculate F Value, superscript s.

s. F Value and Pr > F - These are the F Value and p-value, respectively, testing the null hypothesis that an individual predictor in the model does not explain a significant proportion of the variance, given the other variables are in the model. F Value is computed as MSSource Var / MSError. Under the null hypothesis, F Value follows a central F-distribution with numerator DF = DFSource Var, where Source Var is the predictor variable of interest, and denominator DF =DFError. Following the point made in Source, superscript o, we focus only on the interaction term.
female*prog - This is the F Value and p-value testing the interaction of female and prog on the response variable, given the other variables are in the model. The probability of observing an F Value, as large as, or larger, than 2.39 under the null hypothesis that there is not an interaction of femaleandprog, given the other variables are in the model, is 0.0946. If we set our alpha level at 0.05, the probability of a Type I error, we would fail to reject the null hypothesis that female and prog do not interact. Based on this finding, some would advise rerunning the model without the interaction term, including only the main effects in the model (and the intercept). This would in turn permit a valid interpretation of the main effects of female and prog.

时间: 2024-12-22 16:36:51

SAS Annotated Output GLM的相关文章

sas,log,output,ods输出管理

1:改变log输出到指定外部文件 log一般输出在log窗口,使用printto过程可以改变其默认输出位置 proc printto log = "d:\log.txt" new; *将log输出到指定的文件中,new表示每次覆盖上一次,更多信息到时候查看帮助文档; proc print data=sashelp.class; proc printto; run; *恢复默认log输出; 2:改变output输出到指定外部文件??? proc printto print='e:\log

R语言统计分析应用与SAS、SPSS的比较

能够用来做统计分析的软件和程序很多,目前应用比较广泛的包括:SPSS, SAS.R语言,Matlab,S-PLUS,S-Miner等.下面我们来看一下各应用的特点: SPSS: 最简单的,都是菜单操作,不过不利于二次程序开发. SAS: 需要购买,该软件录入语言要非常精确,不能出错,难操作. R语言:免费软件,可以菜单操作,不过一般要编程的,二次程序开发. Matlab:基本是程序操作,和R语言差不多,不过功能比较强大. S-PLUS: 需要购买,基本也是菜单操作,和SPSS差不多. R与SPS

SaS学习资源收集

目前手里的电子书(2018-08-15 00:24) SAS Programming ISAS Programming IISAS Programming IIIdon't be a SAS dinosaur Modernize Your SAS ProgramsSAS_Certification_Prep_Guide_-_Base_Programming_for_SAS_9,_4th_EditioneThe Little SAS Book A Primer, Fifth Edition 5th

C#生成DBF文件

C# 生成DBF,无需注册Microsoft.Jet.OLEDB. 1 namespace ConsoleApplication 2 { 3 class Program 4 { 5 static void Main(string[] args) 6 { 7 Test(); 8 Console.ReadKey(); 9 } 10 11 private static void Test() 12 { 13 string testPath = AppDomain.CurrentDomain.BaseD

【SAS BASE】ODS OUTPUT

一.ODS的基本性质 ODS输出格式:LISTING(默认的标准SAS输出).HTML.RTF.PRTNTER.PS.PCL.PDF.OUTPUT(SAS OUTPUT Date-set).MARKUP.DOCUMENT; ODS内有table template(指定输出结构)和style template(指定外观结构):首先通过table template作用从procedure中产生数据,形成output project,然后经过style template作用送到destination加

SAS数据步与过程步,数据步语句

SAS数据步与过程步,数据步语句http://www.biostatistic.net/thread-2045-1-1.html  ---转载---原文作者:biostar(出处: 生物统计家园) 数据步与数据步语句 1.数据步基本概念    数据步是产生数据集的一组语句.一个数据步可以建立一个或多个数据集.在一份程序中可以有多个数据步.数据步程序还可以对已建立了的数据集进行修改和产生输出. 2.程序变量与数据集变量    SAS变量有程序变量与数据集变量.数据集的列也叫变量. 3. 数据步的三

对任意函数求导的sas模拟

*模拟求导 步长一定要比阈值小,才能得出准确的结果; data Derivation (keep=interval slope); * function y = 1/x only concern about x>0; deltaX = 1e-6; *割线变为切线时x1减小的步长; x0 = 2; y0 = 0; %function(y0,x0);*需要求导的点; put y0; slope = 0; *需要求得的斜率,即倒数; interval = 5; *x0与x1的在x轴的间距,也是判定停止

SAS零散知识总结

1,变量名命名规范:以字母或者下划线开始,可包含字母.下划线.数字,且不超过32个字符: 2,INFILE用于读取外部数据文件,一般于FILENAME(和LIBNAME用户一致,但路径要精确到文件名(...TXT/DAT等))连用. 3,INPUT语句用于定义字段变量,可用于读去外部文件(INFILE).CARDS.DATALINES. 4,变量类型转换规则:①赋值:以赋值目标的变量类型而定自动转化:②做运算→数值型:③做字符连接→字符型:④经过了字符处理函数→字符型: 5,字符和数值型变量在数

SAS学习经验总结分享:篇五-过程步的应用

之前已经介绍过BASE SAS分为数据步和过程步,过程步是对数据步生成的数据集进行分析和处理,并挖掘数据信息,写出分析报告做总结评价.     1.语法格式: proc 过程名④ <data=数据集名> <选项①>;/*后续会根据标注的序号说明解释*/ 过程语句② ③<参数选项>; run;     2.过程语句②: var  :指定分析变量,多个变量以空格分隔 by:指定一个或多个变量对数据集分组,数据集要先排序 class:指定一个或多个分类变量,不需要事先对数据集