回归_英国酒精和香烟关系

数据统计分析联系:QQ:231469242

英国酒精和香烟官网

http://lib.stat.cmu.edu/DASL/Stories/AlcoholandTobacco.html

Story Name: Alcohol and TobaccoImage: Scatterplot of Alcohol vs. Tobacco, with Northern Ireland marked with a blue X.

Story Topics: Consumer , HealthDatafile Name: Alcohol and TobaccoMethods: Correlation , Dummy variable , Outlier , Regression , ScatterplotAbstract: Data from a British government survey of household spending may be used to examine the relationship between household spending on tobacco products and alcholic beverages. A scatterplot of spending on alcohol vs. spending on tobacco in the 11 regions of Great Britain shows an overall positive linear relationship with Northern Ireland as an outlier. Northern Ireland‘s influence is illustrated by the fact that the correlation between alcohol and tobacco spending jumps from .224 to .784 when Northern Ireland is eliminated from the dataset.

This dataset may be used to illustrate the effect of a single influential observation on regression results. In a simple regression of alcohol spending on tobacco spending, tobacco spending does not appear to be a significant predictor of tobacco spending. However, including a dummy variable that takes the value 1 for Northern Ireland and 0 for all other regions results in significant coefficients for both tobacco spending and the dummy variable, and a high R-squared.

两个模块算出的R平方值一样的

# -*- coding: utf-8 -*-
"""
python3.0
Alcohol and Tobacco 酒精和烟草的关系
http://lib.stat.cmu.edu/DASL/Stories/AlcoholandTobacco.html
很多时候,数据读写不一定是文件,也可以在内存中读写。
StringIO顾名思义就是在内存中读写str。
要把str写入StringIO,我们需要先创建一个StringIO,然后,像文件一样写入即可
"""

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from sklearn.linear_model import LinearRegression
from scipy import stats

list_alcohol=[6.47,6.13,6.19,4.89,5.63,4.52,5.89,4.79,5.27,6.08,4.02]
list_tobacco=[4.03,3.76,3.77,3.34,3.47,2.92,3.20,2.71,3.53,4.51,4.56]
plt.plot(list_tobacco,list_alcohol,‘ro‘)
plt.ylabel(‘Alcohol‘)
plt.ylabel(‘Tobacco‘)
plt.title(‘Sales in Several UK Regions‘)
plt.show()

data=pd.DataFrame({‘Alcohol‘:list_alcohol,‘Tobacco‘:list_tobacco})

result = sm.ols(‘Alcohol ~ Tobacco‘, data[:-1]).fit()
print(result.summary())

python2.7

# -*- coding: utf-8 -*-
#斯皮尔曼等级相关(Spearman’s correlation coefficient for ranked data)
import numpy as np
import scipy.stats as stats
from scipy.stats import f
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.stats.diagnostic import lillifors
import normality_check

y=[6.47,6.13,6.19,4.89,5.63,4.52,5.89,4.79,5.27,6.08]
x=[4.03,3.76,3.77,3.34,3.47,2.92,3.20,2.71,3.53,4.51]
list_group=[x,y]
sample=len(x)

#数据可视化
plt.plot(x,y,‘ro‘)
#斯皮尔曼等级相关,非参数检验
def Spearmanr(x,y):
    print"use spearmanr,Nonparametric tests"
    #样本不一致时,发出警告
    if len(x)!=len(y):
        print "warming,the samples are not equal!"
    r,p=stats.spearmanr(x,y)
    print"spearman r**2:",r**2
    print"spearman p:",p
    if sample<500 and p>0.05:
        print"when sample < 500,p has no mean(>0.05)"
        print"when sample > 500,p has mean"

#皮尔森 ,参数检验
def Pearsonr(x,y):
    print"use Pearson,parametric tests"
    r,p=stats.pearsonr(x,y)
    print"pearson r**2:",r**2
    print"pearson p:",p
    if sample<30:
        print"when sample <30,pearson has no mean"

#kendalltau非参数检验
def Kendalltau(x,y):
    print"use kendalltau,Nonparametric tests"
    r,p=stats.kendalltau(x,y)
    print"kendalltau r**2:",r**2
    print"kendalltau p:",p

#选择模型
def mode(x,y):
    #正态性检验
    Normal_result=normality_check.NormalTest(list_group)
    print "normality result:",Normal_result
    if len(list_group)>2:
        Kendalltau(x,y)
    if Normal_result==False:
        Spearmanr(x,y)
        Kendalltau(x,y)
    if Normal_result==True:
        Pearsonr(x,y)

mode(x,y)
‘‘‘
x=[50,60,70,80,90,95]
y=[500,510,530,580,560,1000]
use shapiro:
data are normal distributed
use shapiro:
data are not normal distributed
normality result: False
use spearmanr,Nonparametric tests
spearman r: 0.942857142857
spearman p: 0.00480466472303
use kendalltau,Nonparametric tests
kendalltau r: 0.866666666667
kendalltau p: 0.0145950349193

#肯德尔系数测试
x=[3,5,2,4,1]
y=[3,5,2,4,1]
z=[3,4,1,5,2]
h=[3,5,1,4,2]
k=[3,5,2,4,1]
‘‘‘

python2.7

# -*- coding: utf-8 -*-
‘‘‘
Author:Toby
QQ:231469242,all right reversed,no commercial use
normality_check.py
正态性检验脚本

‘‘‘

import scipy
from scipy.stats import f
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# additional packages
from statsmodels.stats.diagnostic import lillifors

#正态分布测试
def check_normality(testData):
    #20<样本数<50用normal test算法检验正态分布性
    if 20<len(testData) <50:
       p_value= stats.normaltest(testData)[1]
       if p_value<0.05:
           print"use normaltest"
           print "data are not normal distributed"
           return  False
       else:
           print"use normaltest"
           print "data are normal distributed"
           return True

    #样本数小于50用Shapiro-Wilk算法检验正态分布性
    if len(testData) <50:
       p_value= stats.shapiro(testData)[1]
       if p_value<0.05:
           print "use shapiro:"
           print "data are not normal distributed"
           return  False
       else:
           print "use shapiro:"
           print "data are normal distributed"
           return True

    if 300>=len(testData) >=50:
       p_value= lillifors(testData)[1]
       if p_value<0.05:
           print "use lillifors:"
           print "data are not normal distributed"
           return  False
       else:
           print "use lillifors:"
           print "data are normal distributed"
           return True

    if len(testData) >300:
       p_value= stats.kstest(testData,‘norm‘)[1]
       if p_value<0.05:
           print "use kstest:"
           print "data are not normal distributed"
           return  False
       else:
           print "use kstest:"
           print "data are normal distributed"
           return True

#对所有样本组进行正态性检验
def NormalTest(list_groups):
    for group in list_groups:
        #正态性检验
        status=check_normality(group)
        if status==False :
            return False
    return True

‘‘‘
group1=[2,3,7,2,6]
group2=[10,8,7,5,10]
group3=[10,13,14,13,15]
list_groups=[group1,group2,group3]
list_total=group1+group2+group3
#对所有样本组进行正态性检验
NormalTest(list_groups)
‘‘‘
时间: 2024-10-12 21:11:47

回归_英国酒精和香烟关系的相关文章

卷积神经网络_(2)_分类与回归_几类经典网络简介

1.经典神经网络有:2012年提出的AlexNet和2014年提出的VGGNet,结构图分别如下: 2.分类与回归: (1)分类(classfication):就是经过经过一系列的卷积层和池化层之后,再经过全连接层得到样本属于每个类的得分,再用比如softmax分类其对其进行分类: (2)回归(regression):相当于用一个矩形框来框住要识别的物体,即localization; 如下: 这里,回归用了拟合的方法,即给定输入中物体的位置(x,yw,h),再用卷积网络的输出(x',y',w',

读经典——《CLR via C#》(Jeffrey Richter著) 笔记_命名空间和程序集的关系

命名空间和程序集不一定相关 1. 同一个命名空间中的各个类型可能是在不同的程序集中实现的.(System.IO.FileStream在MSCorLib.dll程序集中,而System.IO.FileSystemWatcher在System.dll程序集中) 2. 同一个程序集中,也可能包含不同命名空间中的类.(System.Int32和System.Text.StringBuilder类型都在MSCorLib.dll程序集中)

高血压_英国指南

Statistically significant reductions in blood pressure were found, in the short term for improved diet and exercise, relaxation therapies, and sodium and alcohol reduction. Most areas featured considerable heterogeneity (i.e. study findings were inco

面向对象程序设计-C++_课时16子类父类关系

初始化列表 类名::类名(形参1,形参2,...形参n):数据成员1(形参1),数据成员2(形参2),...,数据成员n(形参n) { ... } 规则1,初始化列表进行数据成员的初始化 规则2,初始化列表进行父类的初始化 1 #include <iostream> 2 using namespace std; 3 4 class A 5 { 6 private: 7 int i; 8 public: 9 A(int ii) :i(ii)//初始化列表 10 { 11 std::cout &l

一元回归_平均值和个别值的置信区间

数据统计分析项目联系:QQ:231469242 # -*- coding: utf-8 -*- """ Created on Mon Jul 10 11:04:51 2017 @author: toby """ # Import standard packages import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats def fitLine(x

机器学习_深度学习_入门经典(永久免费报名学习)

机器学习_深度学习_入门经典(博主永久免费教学视频系列) https://study.163.com/course/courseMain.htm?courseId=1006390023&share=2&shareId=400000000398149 作者座右铭---- 与其被人工智能代替,不如主动设计机器为我们服务. 长期以来机器学习很多教材描述晦涩难懂,大量专业术语和数学公式让学生望而止步.生活中机器学习就在我们身边,谷歌,百度,Facebook,今日头条都运用大量机器学习算法,实现智能

机器学习(六)— logistic回归

最近一直在看机器学习相关的算法,今天学习logistic回归,在对算法进行了简单分析编程实现之后,通过实例进行验证. 一 logistic概述 个人理解的回归就是发现变量之间的关系,也就是求回归系数,经常用回归来预测目标值.回归和分类同属于监督学习,所不同的是回归的目标变量必须是连续数值型. 今天要学习的logistic回归的主要思想是根据现有的数据对分类边界线建立回归公式,以此进行分类.主要在流行病学中应用较多,比较常用的情形是探索某疾病的危险因素,根据危险因素预测某疾病发生的概率等等.log

多重共线性的解决方法之——岭回归与LASSO

? ? ? 多元线性回归模型 的最小二乘估计结果为 如果存在较强的共线性,即 中各列向量之间存在较强的相关性,会导致的从而引起对角线上的 值很大 并且不一样的样本也会导致参数估计值变化非常大.即参数估计量的方差也增大,对参数的估计会不准确. 因此,是否可以删除掉一些相关性较强的变量呢?如果p个变量之间具有较强的相关性,那么又应当删除哪几个是比较好的呢? 本文介绍两种方法能够判断如何对具有多重共线性的模型进行变量剔除.即岭回归和LASSO(注:LASSO是在岭回归的基础上发展的) ? ? 思想:

人物关系挖掘方案设计

背景 拓展知识图谱-人物关系模块,激发用户兴趣点击,提升流量. 要解决的问题 1.识别人名:ner 命名实体识别. 2.识别两个人是有关系的: 人名共现来说明两个人之间有关系: 词向量计算词与词之间的相似度来说明两个人之间关系. 3.人物关系挖掘. 两个人名满足某种依存模式,则将两个人名和关系抽取出来. 用到的相关nlp算子:分词.词性标注.命名实体识别(NER).依存语法分析.语义角色标注 依存句法中我们所用到的主要关系有:主谓关系(SBV).动宾关系(VOB).定中关系(ATT).并列关系(