回归分析步骤

The 13 Steps for Statistical Modeling in any Regression or ANOVA

No matter what statistical model you’re running, you need to go through the same 13 steps.  The order and the specifics of how you do each step will differ depending on the data and the type of model you use.

These 13 steps are in 3 major parts.  Most people think of only Part 3 as  modeling.  However, if you do all 3 parts, and think of them all as part of the analysis, the modeling process will be faster, easier, and make more sense.

Part 1: Define and Design

In the first 4 steps, the object is clarity. You want to make everything as clear as possible to yourself. The more clear things are at this point, the smoother everything will be.

1. Write out research questions in theoretical and operational terms

A lot of times, when researchers are confused about the right statistical method to use, the real problem is they haven’t defined their research questions.  They have a general idea of the relationship they want to test, but it’s a bit vague.  You need to be very specific.

For each research question, write it down in both theoretical and operational terms.

2. Design the study or define the design

Depending on whether you are collecting your own data or doing secondary data analysis, you need a clear idea of the design.  Design issues are about randomization and sampling:

?    Nested and Crossed Factors
?    Potential confounders and control variables
?    Longitudinal or repeated measurements on a study unit
?    Sampling: simple random sample or stratification or clustering

3. Choose the variables for answering research questions and determine their level of measurement

Every model has to take into account both the design and the level of measurement of the variables.

Level of measurement, remember, is whether a variable is nominal, ordinal, or interval.  Within interval, you also need to know if variables are discrete counts or continuous.

It’s absolutely vital that you know the level of measurement of each response and predictor variable, because they determine both the type of information you can get from your model and the family of models that is appropriate.

4. Write an analysis plan

Write your best guess for the statistical method that will answer the research question, taking into account the design and the type of data.

It does not have to be final at this point—it just needs to be a reasonable approximation.

5. Calculate sample size estimations

This is the point at which you should calculate your sample sizes–before you collect data and after you have an analysis plan.  You need to know which statistical tests you will use as a basis for the estimates.

And there really is no point in running post-hoc power analyses–it doesn’t tell you anything.

Part 2: Prepare and explore

6. Collect, code, enter, and clean data 

The parts that are most directly applicable to modeling are entering data and creating new variables.

For data entry, the analysis plan you wrote will determine how to enter variables.  For example, if you will be doing a linear mixed model, you will want the data in long format.

7. Create new variables

This step may take longer than you think–it can be quite time consuming.  It’s pretty rare for every variable you’ll need for analysis to be collected in exactly the right form.  Create indices, categorize, reverse code, whatever you need to do to get variables in their final form, including running principal components or factor analysis.

8. Run Univariate and Bivariate Statistics

You need to know what you’re working with.  Check the distributions of the variables you intend to use, as well as bivariate relationships among all variables that might go into the model.

You may find something here that leads you back to step 7 or even step 4.   You might have to do some data manipulation or deal with missing data.

More commonly, it will alert you to issues that will become clear in later steps.  The earlier you are aware of issues, the better you can deal with them.  But even if you don’t discover the issue until later, it won’t throw you for a loop if you have a good understanding of your variables.

9. Run an initial model

Once you know what you’re working with, run the model listed in your analysis plan.  In all likelihood, this will not be the final model.

But it should be in the right family of models for the types of variables, the design, and to answer the research questions.  You need to have this model to have something to explore and refine.

Part 3: Refine the model

10. Refine predictors and check model fit

If you are doing a truly exploratory analysis, or if the point of the model is pure prediction, you can use some sort of stepwise approach to determine the best predictors.

If the analysis is to test hypotheses or answer theoretical research questions, this part will be more about refinement.  You can
? Test, and possibly drop, interactions and quadratic or explore other types of non-linearity
? Drop nonsignificant control variables
? Do hierarchical modeling to see the effects of predictors added alone or in blocks.
? Check for overdispersion
? Test the best specification of random effects

11. Test assumptions

Because you already investigated the right family of models in Part 1,  thoroughly investigated your variables in Step 8, and correctly specified your model in Step 10, you should not have big surprises here.  Rather, this step will be about confirming, checking, and refining.  But what you learn here can send you back to any of those steps for further refinement.

12. Check for and resolve data issues

Steps 11 and 12 are often done together, or perhaps back and forth.  This is where you check for data issues that can affect the model, but are not exactly assumptions.  These include:

Data issues are about the data, not the model, but occur within the context of the model
? Multicollinearity
? Outliers and influential points
? Missing data
? Truncation and censoring

Once again, data issues don’t appear until you have chosen variables and put them in the model.

13. Interpret Results

Now, finally, interpret the results.

You may not notice data issues or misspecified predictors until you interpret the coefficients.  Then you find something like a super high standard error or a coefficient with a sign opposite what you expected, sending you back to previous steps.

转自:http://www.theanalysisfactor.com/13-steps-regression-anova/

回归分析的一般步骤

1、确定回归方程中的解释变量和被解释变量。

2、确定回归模型

根据函数拟合方式,通过观察散点图确定应通过哪种数学模型来描述回归线。如果被解释变量和解释变量之间存在线性关系,则应进行线性回归分析,建立线性回归模型;如果被解释变量和解释变量之间存在非线性关系,则应进行非线性回归分析,建立非线性回归模型。

3、建立回归方程

根据收集到的样本数据以及前步所确定的回归模型,在一定的统计拟合准则下估计出模型中的各个参数,得到一个确定的回归方程。

4、对回归方程进行各种检验

由于回归方程是在样本数据基础上得到的,回归方程是否真实地反映了事物总体间的统计关系,以及回归方程能否用于预测等都需要进行检验。

5、利用回归方程进行预测

转自:http://blog.sina.com.cn/s/blog_4bfe1d9501008fyv.html

时间: 2024-10-27 12:34:52

回归分析步骤的相关文章

实验4-EXCEL回归分析

EXCEL回归分析       通过数据间的相关性,我们可以进一步构建回归函数关系,即回归模型,预测数据未来的发展趋势.相关分析与回归分析的联系是:均为研究及测量两个或两个以上变量之间关系的方法.在实际工作中,一般先进行相关分析,计算相关系数,然后拟合回归模型,进行显著性检验,最后用回归模型推算或预测. 简单线性回归简单线性回归也称为一元线性回归,也就是回归模型中只含有一个自变量. 实验目的:以"企业季度数据"为例,先撇开其他费用因素,只考虑推广费用对销售额的影响,如果确定了2012年

IBM SPSS Statistics多变量预测建模

1. 应用背景 1.1 解决的问题 1)大型企业的 IT 系统对每一次应用程序的升级都会预先在其测试环境上进行测试.如何保证测试的有效性?如何通过测试的结果推测其在生产环境上的表现? 2)随着资源使用的增长,CPU.内存.硬盘.I/O 等资源互相影响并存在潜在关联.如何洞察其关联来指导企业做出合理的容量规划? 3)伴随业务扩展,企业生产环境的负载日益增加. 如何帮助企业通过对未来业务量和用户量的增长预测而做出相应的容量预估? 4)如何提供自动化.自适应的建模过程与预测分析,为企业用户打造针对个性

信用评级模型实例分析(以消费金融为例)-中

信用评级模型实例分析(以消费金融为例)-中 原创 2016-10-13 单良 亚联大数据 点击"亚联大数据"可关注我们! 第五章 自变量的初步分析与处理 模型变量有两种类型,分别是连续型变量 .连续型变数系指该变数为观察数据所得的实际数值,并没有经过群组处理 .间断型变数则系指质性变量或类别型变量 . 两种变数类型都适用于评分模型,但建议变量使用间断型态进行开发评分模型,主要原因如下: 1. 间断型变量有助于处理极端值或是样本数量较少的变量. 2. 非线性的因变量 (dependenc

偏最小二乘回归分析建模步骤的R实现(康复俱乐部20名成员测试数据)+补充pls回归系数矩阵的算法实现

kf=read.csv('d:/kf.csv') # 读取康复数据kfsl=as.matrix(kf[,1:3]) #生成生理指标矩阵xl=as.matrix(kf[,4:6]) #生成训练指标矩阵x=slxy=xlyx0=scale(x)x0y0=scale(y)y0m=t(x0)%*%y0%*%t(y0)%*%x0meigen(m)w1=eigen(m)$vectors[,1]v1=t(y0)%*%x0%*%w1/sqrt(as.matrix(eigen(m)$values)[1,])v1t

【数理统计基础】 05 - 回归分析

参数估计和假设检验是数理统计的两个基础问题,它们不光运用于常见的分布,还会出现在各种问题的讨论中.本篇开始研究另一大类问题,就是讨论多个随机变量之间的关系.现实生活中的数据杂乱无章,够挖掘出各种变量之间的关系非常有用,它可以预估变量的走势,能帮助分析状态的根源.关系分析的着手点可以有很多,我们从最简单直观的开始,逐步展开讨论. 1. 一元线性回归 1.1 回归分析 如果把每个量都当做随机变量,问题的讨论会比较困难,或者得到的结论会比较受限.一个明智做法就是只把待考察的量\(Y\)看做随机变量,而

基于caffe与MATLAB接口的回归分析与可视化

如果遇到一些问题,可以在这里找下是否有解决方案.本文内容主要分为两部分,第一部分介绍基于caffe的回归分析,包括了数据准备.配置文件等:第二部分介绍了在MATLAB上进行的可视化.(话说本人最近有个课题需要做场景分类,有兴趣可以共同探讨一下). Preparation 预装好caffe on windows,并编译成功MATLAB接口. 通过caffe进行回归分析 通过caffe进行回归分析,在实验上主要分成HDF5数据准备.网络设计.训练.测试.该实验已经有网友做过,可以参考:http://

利用Spark mllab进行机器学习的基本操作(聚类,分类,回归分析)

Spark作为一种开源集群计算环境,具有分布式的快速数据处理能力.而Spark中的Mllib定义了各种各样用于机器学习的数据结构以及算法.Python具有Spark的API.需要注意的是,Spark中,所有数据的处理都是基于RDD的. 首先举一个聚类方面的详细应用例子Kmeans: 下面代码是一些基本步骤,包括外部数据,RDD预处理,训练模型,预测. #coding:utf-8 from numpy import array from math import sqrt from pyspark

微软数据挖掘算法:Microsoft 线性回归分析算法(11)

前言 此篇为微软系列挖掘算法的最后一篇了,完整该篇之后,微软在商业智能这块提供的一系列挖掘算法我们就算总结完成了,在此系列中涵盖了微软在商业智能(BI)模块系统所能提供的所有挖掘算法,当然此框架完全可以自己扩充,可以自定义挖掘算法,不过目前此系列中还不涉及,只涉及微软提供的算法,当然这些算法已经基本涵盖大部分的商业数据挖掘的应用场景,也就是说熟练了这些算法大部分的应用场景都能游刃有余的解决,每篇算法总结包含:算法原理.算法特点.应用场景以及具体的操作详细步骤.为了方便阅读,我还特定整理一篇目录:

一元线性回归分析笔记

1.定义: 利用已有样本,产自拟合方程,从而对(未知数据)进行预测. 2.用途: 预测,合理性判断. 3.分类: 线性回归分析:一元线性回归,多元线性回归,广义线性(将非线性转化为线性回归,logic回归) 非线性回归分析 4.困难: 变量选取,多重共线性,观察拟合方程,避免过度拟合 5.关系: 函数关系:确定性关系,y=a*x+b 相关关系:非确定性关系 相关系数:正数为正相关(同增同长),负数为负相关(同增同减) 6.一元线性回归模型: 1) 若X与Y间存在着较强的相关关系,则我们有Y≍a+