从Iris数据集开始---机器学习入门

代码多来自《Introduction to Machine Learning with Python》. 该文集主要是自己的一个阅读笔记以及一些小思考，小总结。

#前言
在开始进行模型训练之前，非常有必要了解准备的数据：数据的特征，数据和目标结果之间的关系是什么？而且这可能是机器学习过程中最重要的部分。

在开始使用机器学习实际应用时，有必要先回答下面几个问题：

解决的问题是什么？现在收集的数据能够解决目前的问题吗？
该问题可以转换成机器学习问题吗？如果可以，具体属于哪一类？监督 or 非监督
从数据中抽取哪些特征？足够支持去做预测吗？
训练好模型后，如何确保模型是可以信赖的？---是骡子是马牵出来溜溜。

机器学习算法只是处理问题过程中的一个小部分而已!
处理问题时，保持一个大局观，上帝视角，从整个处理流程上看问题，不要只局限于某一个小部分。难道这就是传说中的 牵一发而动全身？

从Iris分类，谈入门

很明确：这是一个分类问题。

导入应用包

import pandas as pd #数据分析、处理
import numpy as np #科学计算包
import matplotlib.pyplot as plt #画图
%matplotlib inline #显示在Notebook里

加载数据集，观察数据

from sklearn.datasets import load_iris
iris_dataset = load_iris() #sklearn已经整理了Iris数据集，使用load_iris函数可以直接下载，使用；

我们输出看一下：
print(iris_dataset)#发现数据集整理成了一个大字典；

output:

{‘feature_names‘: [‘sepal length (cm)‘, ‘sepal width (cm)‘, ‘petal length (cm)‘, ‘petal width (cm)‘], ‘target‘: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), ‘DESCR‘: ‘Iris Plants Database\n====================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%[email protected])\n    :Date: July, 1988\n\nThis is a copy of UCI ML iris datasets.\nhttp://archive.ics.uci.edu/ml/datasets/Iris\n\nThe famous Iris database, first used by Sir R.A Fisher\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\‘s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\nReferences\n----------\n   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...\n‘, ‘target_names‘: array([‘setosa‘, ‘versicolor‘, ‘virginica‘], dtype=‘<U10‘), ‘data‘: array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
      ...
       [6.3, 2.7, 4.9, 1.8],
       [6.7, 3.3, 5.7, 2.1],
       [7.2, 3. , 5.8, 1.6],
       [7.4, 2.8, 6.1, 1.9],
       [7.9, 3.8, 6.4, 2. ],
       [6.4, 2.8, 5.6, 2.2],
       [5.9, 3. , 5.1, 1.8]])}

看一下字典的键
print("Keys of iris_dataset:\n{}".format(iris_dataset.keys()))#有5个键；我们逐个看看
output:

Keys of iris_dataset:
dict_keys([‘feature_names‘, ‘target‘, ‘DESCR‘, ‘target_names‘, ‘data‘])

逐个看看：
看看DESCR:

print(‘DESCR of iris_dataset:\n{}‘.format(iris_dataset[‘DESCR‘]))#数据集的描述信息；
#我们知道有150条记录（每类50条，一共有3类）；
#属性：
#4个数值型，用来预测的属性：sepal 长、宽；petal长、宽
#一个类别标签：三类Setosa，Versicolour，Virginica；

output:

<pre style="box-sizing: border-box; overflow: auto; font-family: monospace; font-size: 14px; display: block; padding: 0px; margin: 0px; line-height: inherit; word-break: break-all; word-wrap: break-word; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); border: 0px; border-radius: 0px; white-space: pre-wrap; vertical-align: baseline; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">DESCR of iris_dataset:
Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%[email protected])
    :Date: July, 1988

看看Feature_names:
print(‘Feature names of iris_dataset:\n{}‘.format(iris_dataset[‘feature_names‘]))#4个特征
output:
Feature names of iris_dataset: [‘sepal length (cm)‘, ‘sepal width (cm)‘, ‘petal length (cm)‘, ‘petal width (cm)‘]
看看data

print(‘data of iris_dataset:\n{}‘.format(iris_dataset[‘data‘][:5]))#看数据的前5条；
print(‘shape of iris_dataset:\n{}‘.format(iris_dataset[‘data‘].shape))#data形状：150*4；150条记录，没错

output:

data of iris_dataset:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
shape of iris_dataset:
(150, 4)

看看target_names:

print(‘target_names of iris_dataset:\n{}‘.format(iris_dataset[‘target_names‘]))#3类

output:
target_names of iris_dataset: [‘setosa‘ ‘versicolor‘ ‘virginica‘]
看看target:

print(‘target of iris_dataset:\n{}‘.format(iris_dataset[‘target‘][:5]))#全是0;数据是按照类别进行排序的；全是0，全是1，全是2；
print(‘target shape of iris_dataset:\n{}‘.format(iris_dataset[‘target‘].shape))#说明有150个标签，一维数组；

output:

target of iris_dataset:
[0 0 0 0 0]
target shape of iris_dataset:
(150,)

划分数据，方便评测

#划分一下数据集，方便对训练后的模型进行评测？可信否？
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(iris_dataset[‘data‘],iris_dataset[‘target‘],
                              test_size=0.25,random_state=0)
#第一个参数：数据；第二个参数：标签；第三个参数：测试集所占比例；第四个参数：random_state=0：确保无论这条代码，运行多少次，
#产生出来的训练集和测试集都是一模一样的，减少不必要的影响；

#观察一下划分后数据：
print(‘shape of X_train:{}‘.format(X_train.shape))
print(‘shape of y_train:{}‘.format(y_train.shape))
print(‘=‘*64)
print(‘shape of X_test:{}‘.format(X_test.shape))
print(‘shape of y_test:{}‘.format(y_test.shape))

输出：

shape of X_train:(112, 4)
shape of y_train:(112,)
================================================================
shape of X_test:(38, 4)
shape of y_test:(38,)

画图观察一下数据

#画图观察一下数据：问题是否棘手？
#一般画图使用scatter plot 散点图，但是有一个缺点：只能观察2维的数据情况；如果想观察多个特征之间的数据情况，scatter plot并不可行；
#用pair plot 可以观察到任意两个特征之间的关系图（对角线为直方图）；恰巧：pandas的 scatter_matrix函数能画pair plots。
#所以，我们先把训练集转换成DataFrame形式，方便画图；
iris_dataframe = pd.DataFrame(X_train,columns=iris_dataset.feature_names)

grr = pd.scatter_matrix(iris_dataframe,c=y_train,figsize=(15,15),marker=‘o‘,hist_kwds={‘bins‘:20},s=60,                       alpha=.8)#不同颜色代表不同的分类；

可以发现：目前的特征来说，完全可以进行分类。下面就进行模型训练：模型选择最简单的knn k近邻的特殊形式--最近邻（与当前点最近点的类别作为该点的标签）。

模型训练

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)#设置为最近邻；

训练模型

knn.fit(X_train,y_train)
output：

KNeighborsClassifier(algorithm=‘auto‘, leaf_size=30, metric=‘minkowski‘,
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights=‘uniform‘)

训练完成，我们使用测试集，评估一下训练效果。是否值得信赖？

评估模型

方法一：手动计算

y_pred = knn.predict(X_test)#预测
print(‘Test Set Score:{:.2f}‘.format(np.mean(y_test == y_pred)))#自己计算得分情况；准确率

output:

Test Set Score:0.97
Socore为97%：说明在测试集上有97%的记录都被正确分类了；分类效果很好，值得信赖！

方法二：score函数

print(‘Test Set Score:{:.2f}‘.format(knn.score(X_test,y_test)))
#用测试集去打个分，看看得分情况，确定分类器是否可信；
#Socore为97%：说明在测试集上有97%的记录都被正确分类了；分类效果很好，值得信赖！

模型应用

训练模型，最终的目的还是应用，应用在新的记录上，预测其分类。

#我们可以用训练好的模型去应用了：unseen data
X_new = np.array([[5,2.9,1,0.2]]) #新数据 为什么定为2维的？ 因为sklearn 总是期望收到二维的numpy数组.
result = knn.predict(X_new)
print(‘Prediction:{}‘.format(result))
print(‘Predicted target name:{}‘.format(iris_dataset[‘target_names‘][result]))

output:

Prediction:[0]
Predicted target name:[‘setosa‘]

小结一下

核心代码段：

X_train, X_test, y_train, y_test = train_test_split(
iris_dataset[‘data‘], iris_dataset[‘target‘], random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

result = knn.predict(X_new)

从这段代码就可以大致看出，应用sklearn中算法的大致流程：

实例化一个Estimator：分类，回归etc。
使用训练集对模型进行训练。fit方法：sklearn算法中几乎都有这个借口；
score(X_test,y_test)：对训练好的模型，做个评估；知道训练结果好坏；
predict :可以对数据进行预测；这是最终的目的。

再有，从Iris数据分类这个例子来看，我们大部分的精力都用在了对数据的理解和分析上，真正用在算法训练上的时间反而很少。

理解数据！理解数据！理解数据！

原文地址：https://www.cnblogs.com/ysugyl/p/8678686.html

时间： 2024-07-30 17:00:10

从Iris数据集开始---机器学习入门的相关文章

初学者的机器学习入门实战教程！

文章来源: https://www.jianshu.com/p/091b7dc8f12a 这是一篇手把手教你使用 Python 实现机器学习算法,并在数值型数据和图像数据集上运行模型的入门教程,当你看完本文后,你应当可以开始你的机器学习之旅了! 本教程会采用下述两个库来实现机器学习算法: scikit-learn Keras 此外,你还将学习到: 评估你的问题准备数据(原始数据.特征提取.特征工程等等) 检查各种机器学习算法检验实验结果深入了解性能最好的算法在本文会用到的机器学习算法

机器学习入门：线性回归及梯度下降

机器学习入门:线性回归及梯度下降本文会讲到: (1)线性回归的定义 (2)单变量线性回归 (3)cost function:评价线性回归是否拟合训练集的方法 (4)梯度下降:解决线性回归的方法之一 (5)feature scaling:加快梯度下降执行速度的方法 (6)多变量线性回归 Linear Regression 注意一句话:多变量线性回归之前必须要Feature Scaling! 方法:线性回归属于监督学习,因此方法和监督学习应该是一样的,先给定一个训练集,根据这个训练集学习出一个

web安全之机器学习入门——3.1 KNN/k近邻算法

目录 sklearn.neighbors.NearestNeighbors 参数/方法基础用法用于监督学习检测异常操作(一) 检测异常操作(二) 检测rootkit 检测webshell sklearn.neighbors.NearestNeighbors 参数: 方法: 基础用法 print(__doc__) from sklearn.neighbors import NearestNeighbors import numpy as np X = np.array([[-1, -1],

用Python实现岭回归算法与Lasso回归算法并处理Iris数据集

在介绍岭回归算法与Lasso回归算法之前,先要回顾一下线性回归算法.根据线性回归模型的参数估计公式可知可知,得到的前提是矩阵可逆.换句话说就是样本各个特征(自变量)之间线性无关.然而在实际问题中,常常会出现特征之间出现多重共线性的情况,使得行列式的值接近于0,最终造成回归系数无解或者无意义. 为了解决这个问题,岭回归算法的方法是在线性回归模型的目标函数之上添加一个l2的正则项,进而使得模型的回归系数有解.具体的岭回归目标函数可表示为如下: 在Python中,岭回归算法的实现方法如下. 在Pyth

数据分析经典案例-鸢尾花(iris)数据集分析

鸢尾花(iris)数据集分析 Gaius_Yao 关注 0.8 2018.12.23 14:06 字数 724 阅读 4827评论 0喜欢 5 Iris 鸢尾花数据集是一个经典数据集,在统计学习和机器学习领域都经常被用作示例.数据集内包含 3 类共 150 条记录,每类各 50 个数据,每条记录都有 4 项特征:花萼长度.花萼宽度.花瓣长度.花瓣宽度,可以通过这4个特征预测鸢尾花卉属于(iris-setosa, iris-versicolour, iris-virginica)中的哪一品种.

【机器学习】机器学习入门08 - 聚类与聚类算法K-Means

时间过得很快,这篇文章已经是机器学习入门系列的最后一篇了.短短八周的时间里,虽然对机器学习并没有太多应用和熟悉的机会,但对于机器学习一些基本概念已经差不多有了一个提纲挈领的了解,如分类和回归,损失函数,以及一些简单的算法--kNN算法.决策树算法等. 那么,今天就用聚类和K-Means算法来结束我们这段机器学习之旅. 1. 聚类 1.1 什么是聚类将物理或抽象对象的集合分成由类似的对象组成的多个类的过程被称为聚类.由聚类所生成的簇是一组数据对象的集合,这些对象与同一个簇中的对象彼此相似,与其他

写给程序员的机器学习入门 (一) - 从基础说起

前段时间因为店铺不能开门,我花了一些空余时间看了很多机器学习相关的资料,我发现目前的机器学习入门大多要不门槛比较高,要不过于着重使用而忽视基础原理,所以我决定开一个新的系列针对程序员讲讲机器学习.这个系列会从机器学习的基础原理开始一直讲到如何应用,看懂这个系列需要一定的编程知识(主要会使用 python 语言),但不需要过多的数学知识,并且对于涉及到的数学知识会作出简单的介绍.因为我水平有限(不是专业的机器学习工程师),这个系列不会讲的非常深入,看完可能也就只能做一个调参狗,各路大佬觉得哪些部分

机器学习入门资源--汇总

机器学习入门资源--汇总基本概念机器学习机器学习是近20多年兴起的一门多领域交叉学科,涉及概率论.统计学.逼近论.凸分析.算法复杂度理论等多门学科.机器学习理论主要是设计和分析一些让计算机可以自动“学习”的算法.机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法.因为学习算法中涉及了大量的统计学理论,机器学习与统计推断学联系尤为密切,也被称为统计学习理论.算法设计方面,机器学习理论关注可以实现的,行之有效的学习算法. 下面从微观到宏观试着梳理一下机器学习的范畴:

【R】如何确定最适合数据集的机器学习算法 - 雪晴数据网

[R]如何确定最适合数据集的机器学习算法 [R]如何确定最适合数据集的机器学习算法发布时间: 2016-02-25 阅读数: 199 抽查(Spot checking)机器学习算法是指如何找出最适合于给定数据集的算法模型.本文中我将介绍八个常用于抽查的机器学习算法,文中还包括各个算法的 R 语言代码,你可以将其保存并运用到下一个机器学习项目中. 适用于你的数据集的最佳算法你无法在建模前就知道哪个算法最适用于你的数据集.你必须通过反复试验的方法来寻找出可以解决你的问题的最佳算法,我称这个过程为