数据分析经典案例-鸢尾花(iris)数据集分析

鸢尾花(iris)数据集分析

0.8 2018.12.23 14:06 字数 724 阅读 4827评论 0喜欢 5

Iris 鸢尾花数据集是一个经典数据集，在统计学习和机器学习领域都经常被用作示例。数据集内包含 3 类共 150 条记录，每类各 50 个数据，每条记录都有 4 项特征：花萼长度、花萼宽度、花瓣长度、花瓣宽度，可以通过这4个特征预测鸢尾花卉属于（iris-setosa, iris-versicolour, iris-virginica）中的哪一品种。

据说在现实中，这三种花的基本判别依据其实是种子（因为花瓣非常容易枯萎）。

0 准备数据

下面对 iris 进行探索性分析，首先导入相关包和数据集：

# 导入相关包
import numpy as np
import pandas as pd
from pandas import plotting

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use(‘seaborn‘)

import seaborn as sns
sns.set_style("whitegrid")

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

# 导入数据集
iris = pd.read_csv(‘F:\pydata\dataset\kaggle\iris.csv‘, usecols=[1, 2, 3, 4, 5])

查看数据集信息：

iris.info()

<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetalLengthCm    150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB

查看数据集的头 5 条记录：

iris.head()

1 探索性分析

先查看数据集各特征列的摘要统计信息：

iris.describe()

通过Violinplot 和 Pointplot，分别从数据分布和斜率，观察各特征与品种之间的关系：

# 设置颜色主题
antV = [‘#1890FF‘, ‘#2FC25B‘, ‘#FACC14‘, ‘#223273‘, ‘#8543E0‘, ‘#13C2C2‘, ‘#3436c7‘, ‘#F04864‘]

# 绘制  Violinplot
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)

sns.violinplot(x=‘Species‘, y=‘SepalLengthCm‘, data=iris, palette=antV, ax=axes[0, 0])
sns.violinplot(x=‘Species‘, y=‘SepalWidthCm‘, data=iris, palette=antV, ax=axes[0, 1])
sns.violinplot(x=‘Species‘, y=‘PetalLengthCm‘, data=iris, palette=antV, ax=axes[1, 0])
sns.violinplot(x=‘Species‘, y=‘PetalWidthCm‘, data=iris, palette=antV, ax=axes[1, 1])

plt.show()

# 绘制  pointplot
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)

sns.pointplot(x=‘Species‘, y=‘SepalLengthCm‘, data=iris, color=antV[0], ax=axes[0, 0])
sns.pointplot(x=‘Species‘, y=‘SepalWidthCm‘, data=iris, color=antV[0], ax=axes[0, 1])
sns.pointplot(x=‘Species‘, y=‘PetalLengthCm‘, data=iris, color=antV[0], ax=axes[1, 0])
sns.pointplot(x=‘Species‘, y=‘PetalWidthCm‘, data=iris, color=antV[0], ax=axes[1, 1])

plt.show()

生成各特征之间关系的矩阵图：

g = sns.pairplot(data=iris, palette=antV, hue= ‘Species‘)

使用 Andrews Curves 将每个多变量观测值转换为曲线并表示傅立叶级数的系数，这对于检测时间序列数据中的异常值很有用。

Andrews Curves 是一种通过将每个观察映射到函数来可视化多维数据的方法。

plt.subplots(figsize = (10,8))
plotting.andrews_curves(iris, ‘Species‘, colormap=‘cool‘)

plt.show()

下面分别基于花萼和花瓣做线性回归的可视化：

g = sns.lmplot(data=iris, x=‘SepalWidthCm‘, y=‘SepalLengthCm‘, palette=antV, hue=‘Species‘)

g = sns.lmplot(data=iris, x=‘PetalWidthCm‘, y=‘PetalLengthCm‘, palette=antV, hue=‘Species‘)

最后，通过热图找出数据集中不同特征之间的相关性，高正值或负值表明特征具有高度相关性：

fig=plt.gcf()
fig.set_size_inches(12, 8)
fig=sns.heatmap(iris.corr(), annot=True, cmap=‘GnBu‘, linewidths=1, linecolor=‘k‘, square=True, mask=False, vmin=-1, vmax=1, cbar_kws={"orientation": "vertical"}, cbar=True)

从热图可看出，花萼的宽度和长度不相关，而花瓣的宽度和长度则高度相关。

2 机器学习

接下来，通过机器学习，以花萼和花瓣的尺寸为根据，预测其品种。

在进行机器学习之前，将数据集拆分为训练和测试数据集。首先，使用标签编码将 3 种鸢尾花的品种名称转换为分类值（0, 1, 2）。

# 载入特征和标签集
X = iris[[‘SepalLengthCm‘, ‘SepalWidthCm‘, ‘PetalLengthCm‘, ‘PetalWidthCm‘]]
y = iris[‘Species‘]

# 对标签集进行编码
encoder = LabelEncoder()
y = encoder.fit_transform(y)
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

接着，将数据集以 7: 3 的比例，拆分为训练数据和测试数据：

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 101)
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

(105, 4) (105,) (45, 4) (45,)

检查不同模型的准确性：

# Support Vector Machine
model = svm.SVC()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print(‘The accuracy of the SVM is: {0}‘.format(metrics.accuracy_score(prediction,test_y)))

The accuracy of the SVM is: 1.0

# Logistic Regression
model = LogisticRegression()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print(‘The accuracy of the Logistic Regression is: {0}‘.format(metrics.accuracy_score(prediction,test_y)))

The accuracy of the Logistic Regression is: 0.9555555555555556

# Decision Tree
model=DecisionTreeClassifier()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print(‘The accuracy of the Decision Tree is: {0}‘.format(metrics.accuracy_score(prediction,test_y)))

The accuracy of the Decision Tree is: 0.9555555555555556

# K-Nearest Neighbours
model=KNeighborsClassifier(n_neighbors=3)
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print(‘The accuracy of the KNN is: {0}‘.format(metrics.accuracy_score(prediction,test_y)))

The accuracy of the KNN is: 1.0

上面使用了数据集的所有特征，下面将分别使用花瓣和花萼的尺寸：

petal = iris[[‘PetalLengthCm‘, ‘PetalWidthCm‘, ‘Species‘]]
train_p,test_p=train_test_split(petal,test_size=0.3,random_state=0)
train_x_p=train_p[[‘PetalWidthCm‘,‘PetalLengthCm‘]]
train_y_p=train_p.Species
test_x_p=test_p[[‘PetalWidthCm‘,‘PetalLengthCm‘]]
test_y_p=test_p.Species

sepal = iris[[‘SepalLengthCm‘, ‘SepalWidthCm‘, ‘Species‘]]
train_s,test_s=train_test_split(sepal,test_size=0.3,random_state=0)
train_x_s=train_s[[‘SepalWidthCm‘,‘SepalLengthCm‘]]
train_y_s=train_s.Species
test_x_s=test_s[[‘SepalWidthCm‘,‘SepalLengthCm‘]]
test_y_s=test_s.Species

model=svm.SVC()

model.fit(train_x_p,train_y_p)
prediction=model.predict(test_x_p)
print(‘The accuracy of the SVM using Petals is: {0}‘.format(metrics.accuracy_score(prediction,test_y_p)))

model.fit(train_x_s,train_y_s)
prediction=model.predict(test_x_s)
print(‘The accuracy of the SVM using Sepal is: {0}‘.format(metrics.accuracy_score(prediction,test_y_s)))

The accuracy of the SVM using Petals is: 0.9777777777777777
The accuracy of the SVM using Sepal is: 0.8

model = LogisticRegression()

model.fit(train_x_p, train_y_p)
prediction = model.predict(test_x_p)
print(‘The accuracy of the Logistic Regression using Petals is: {0}‘.format(metrics.accuracy_score(prediction,test_y_p)))

model.fit(train_x_s, train_y_s)
prediction = model.predict(test_x_s)
print(‘The accuracy of the Logistic Regression using Sepals is: {0}‘.format(metrics.accuracy_score(prediction,test_y_s)))

The accuracy of the Logistic Regression using Petals is: 0.6888888888888889
The accuracy of the Logistic Regression using Sepals is: 0.6444444444444445

model=DecisionTreeClassifier()

model.fit(train_x_p, train_y_p)
prediction = model.predict(test_x_p)
print(‘The accuracy of the Decision Tree using Petals is: {0}‘.format(metrics.accuracy_score(prediction,test_y_p)))

model.fit(train_x_s, train_y_s)
prediction = model.predict(test_x_s)
print(‘The accuracy of the Decision Tree using Sepals is: {0}‘.format(metrics.accuracy_score(prediction,test_y_s)))

The accuracy of the Decision Tree using Petals is: 0.9555555555555556
The accuracy of the Decision Tree using Sepals is: 0.6666666666666666

model=KNeighborsClassifier(n_neighbors=3) 

model.fit(train_x_p, train_y_p)
prediction = model.predict(test_x_p)
print(‘The accuracy of the KNN using Petals is: {0}‘.format(metrics.accuracy_score(prediction,test_y_p)))

model.fit(train_x_s, train_y_s)
prediction = model.predict(test_x_s)
print(‘The accuracy of the KNN using Sepals is: {0}‘.format(metrics.accuracy_score(prediction,test_y_s)))

The accuracy of the KNN using Petals is: 0.9777777777777777
The accuracy of the KNN using Sepals is: 0.7333333333333333

从中不难看出，使用花瓣的尺寸来训练数据较花萼更准确。正如在探索性分析的热图中所看到的那样，花萼的宽度和长度之间的相关性非常低，而花瓣的宽度和长度之间的相关性非常高。

原文地址：https://www.cnblogs.com/CSUT-Ryan/p/11165238.html

时间： 2024-08-27 13:54:18

数据分析经典案例-鸢尾花(iris)数据集分析的相关文章

【Python数据挖掘课程】四.决策树DTC数据分析及鸢尾数据集分析

今天主要讲述的内容是关于决策树的知识,主要包括以下内容: 1.分类及决策树算法介绍 2.鸢尾花卉数据集介绍 3.决策树实现鸢尾数据集分析前文推荐: [Python数据挖掘课程]一.安装Python及爬虫入门介绍 [Python数据挖掘课程]二.Kmeans聚类数据分析及Anaconda介绍 [Python数据挖掘课程]三.Kmeans聚类代码实现.作业及优化希望这篇文章对你有所帮助,尤其

Hadoop经典案例Spark实现（七）——日志分析：分析非结构化文件

相关文章推荐 Hadoop经典案例Spark实现(一)--通过采集的气象数据分析每年的最高温度 Hadoop经典案例Spark实现(二)--数据去重问题 Hadoop经典案例Spark实现(三)--数据排序 Hadoop经典案例Spark实现(四)--平均成绩 Hadoop经典案例Spark实现(五)--求最大最小值问题 Hadoop经典案例Spark实现(六)--求最大的K个值并排序 Hadoop经典案例Spark实现(七)--日志分析:分析非结构化文件 1.需求:根据tomcat日志计算ur

Systemstate Dump分析经典案例（上）

前言本期我们邀请中亦科技的另外一位Oracle专家老K来给大家分享systemstate dump分析的经典案例.后续我们还会有更多技术专家带来更多诚意分享. 老K作为一个长期在数据中心奋战的数据库工程师,看到小y前期的分享,有种跃跃欲试的感觉,也想把我日常遇到的一些有意思的案例拿出来分享讨论,希望我们都能从中获得些许收获,少走弯路.同时本文涉及到很多基础知识,又涉及看似枯燥的trace分析,但老K还是建议大家耐心看完本文. 精彩预告如何分析cursor:pin S wait on X? 如

R语言重要数据集分析研究——需要整理分析阐明理念

1.R语言重要数据集分析研究需要整理分析阐明理念? 上一节讲了R语言作图,本节来讲讲当你拿到一个数据集的时候如何下手分析,数据分析的第一步,探索性数据分析. 统计量,即统计学里面关注的数据集的几个指标,常用的如下:最小值,最大值,四分位数,均值,中位数,众数,方差,标准差,极差,偏度,峰度先来解释一下各个量得含义,浅显就不说了,这里主要说一下不常见的众数:出现次数最多的方差:每个样本值与均值的差得平方和的平均数标准差:又称均方差,是方差的二次方根,用来衡量一个数据集的集中性极差:最大值

大数据高冷？10个有趣的大数据经典案例

马云说:互联网还没搞清楚的时候,移动互联就来了,移动互联还没搞清楚的时候,大数据就来了.近两年,“大数据”这个词越来越为大众所熟悉,“大数据”一直是以高冷的形象出现在大众面前,面对大数据,相信许多人都一头雾水.下面我们通过十个经典案例,让大家实打实触摸一把“大数据”.你会发现它其实就在身边而且也是很有趣的. 啤酒与尿布全球零售业巨头沃尔玛在对消费者购物行为分析时发现,男性顾客在购买婴儿尿片时,常常会顺便搭配几瓶啤酒来犒劳自己,于是尝试推出了将啤酒和尿布摆在一起的促销手段.没想到这个举措

从Iris数据集开始---机器学习入门

代码多来自<Introduction to Machine Learning with Python>. 该文集主要是自己的一个阅读笔记以及一些小思考,小总结. #前言在开始进行模型训练之前,非常有必要了解准备的数据:数据的特征,数据和目标结果之间的关系是什么?而且这可能是机器学习过程中最重要的部分. 在开始使用机器学习实际应用时,有必要先回答下面几个问题: 解决的问题是什么?现在收集的数据能够解决目前的问题吗? 该问题可以转换成机器学习问题吗?如果可以,具体属于哪一类?监督 or 非监督

多线程十大经典案例之一双线程读写队列数据

本文配套程序下载地址为:http://download.csdn.net/detail/morewindows/5136035 转载请标明出处,原文地址:http://blog.csdn.net/morewindows/article/details/8646902 欢迎关注微博:http://weibo.com/MoreWindows 在<秒杀多线程系列>的前十五篇中介绍多线程的相关概念,多线程同步互斥问题<秒杀多线程第四篇一个经典的多线程同步问题>及解决多线程同步互斥的常用方法

秒杀多线程第十六篇多线程十大经典案例之一双线程读写队列数据

版权声明:本文为博主原创文章,未经博主允许不得转载. 目录(?)[+] 本文配套程序下载地址为:http://download.csdn.net/detail/morewindows/5136035 转载请标明出处,原文地址:http://blog.csdn.net/morewindows/article/details/8646902 欢迎关注微博:http://weibo.com/MoreWindows 在<秒杀多线程系列>的前十五篇中介绍多线程的相关概念,多线程同步互斥问题<秒杀多

网络机器人的识别与攻防的经典案例

本文我们介绍一个网络机器人的识别与攻防的经典案例.使用到的代码见本人的superword项目: https://github.com/ysc/superword/blob/master/src/main/java/org/apdplat/superword/tools/ProxyIp.java 我们的目的是要使用机器人自动获取站点http://ip.qiaodm.com/ 和站点http://proxy.goubanjia.com/ 的免费高速HTTP代理IP和端口号. 不过他们未对机器人进行识