sklearn学习笔记1

Image recognition with Support Vector Machines

#our dataset is provided within scikit-learn
#let‘s start by importing and printing its description
import sklearn as sk
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces()
print(faces.DESCR)

Modified Olivetti faces dataset. The original database was available from (now defunct)

http://www.uk.research.att.com/facedatabase.html

The version retrieved here comes in MATLAB format from the personal web page of Sam Roweis:

http://www.cs.nyu.edu/~roweis/

There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement). The original dataset consisted of 92 x 112, while the Roweis version consists of 64x64 images.

print(faces.keys())
print(faces.images.shape)
print(faces.data.shape)
print(faces.target.shape)
print(np.max(faces.data))
print(np.min(faces.data))
print(np.mean(faces.data))

Before learning, let’s plot some faces.

def print_faces(images, target, top_n):
    #set up the figure size in inches
    fig = plt.figure(figsize=(12, 12))
    fig.subplots_adjust(left = 0, right = 1, bottom = 0, top = 1, hspace = 0.05,wspace = 0.05)
    for i in range(top_n):
        #plot the images in a matrix of 20x20
        p = fig.add_subplot(20, 20, i + 1, xticks = [], yticks = [])
        p.imshow(images[i], cmap = plt.cm.bone)
        #label the image with target value
        p.text(0, 14, str(target[i]))
        p.text(0, 60, str(i))

If we print the first 20 images, we can see faces from two faces.(但是不知道为什么,打印不出图片)

print(faces.images, faces.target, 20)

Training a Support Vector machine

Import the SVC class from the sklearn.svm module:

from sklearn.svm import SVC
To start, we will use the simplest kernel, the linear one
svc_1 = SVC(kernel = ‘linear‘)

Before continuing, we will split our dataset into training and testing datasets.

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(faces.data,faces.target, test_size = 0.25, random_state = 0)

And we will define a function to evaluate K-fold cross-validation.

from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem
def evaluate_cross_validation(clf, X, y, K):
    #create a k-fold cross validation iterator
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    #by default the score used is the one return by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv = cv)
    print(scores)
    print(("Mean score: {0: .3f} (+/-{1: .3f})").format(np.mean(scores), sem(scores)))
evaluate_cross_validation(svc_1, X_train, y_train, 5)
[ 0.93333333  0.86666667  0.91666667  0.93333333  0.91666667]
Mean score:  0.913 (+/- 0.012)

We will also define a function to perform training on the training set and evaluate the performance on the testing set.

from sklearn import metrics
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    clf.fit(X_train, y_train)
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))

    y_pred = clf.predict(X_test)

    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(metrics.confusion_matrix(y_test, y_pred))
train_and_evaluate(svc_1, X_train, X_test, y_train, y_test)

Accuracy on training set: 1.0 Accuracy on testing set: 0.99

classify the faces as people with and without glasses

First thing to do is to defne the range of the images that show faces wearing glasses.
The following list shows the indexes of these images:

# the index ranges of images of people with glasses
glasses = [
(10, 19), (30, 32), (37, 38), (50, 59), (63, 64),
(69, 69), (120, 121), (124, 129), (130, 139), (160, 161),
(164, 169), (180, 182), (185, 185), (189, 189), (190, 192),
(194, 194), (196, 199), (260, 269), (270, 279), (300, 309),
(330, 339), (358, 359), (360, 369)
]

Then we‘ll defne a function that from those segments returns a new target array that marks with 1 for the faces with glasses and 0 for the faces without glasses (our new target classes):

def create_target(segments):
    # create a new y array of target size initialized with zeros
    y = np.zeros(faces.target.shape[0])
    # put 1 in the specified segments
    for (start, end) in segments:
        y[start:end + 1] = 1
    return y
target_glasses = create_target(glasses)

So we must perform the training/testing split again.

X_train, X_test, y_train, y_test = train_test_split(faces.data, target_glasses, test_size=0.25, random_state=0)

Now let‘s create a new SVC classifer, and train it with the new target vector using the following command:

svc_2 = SVC(kernel=‘linear‘)

If we check the performance with cross-validation by the following code:

evaluate_cross_validation(svc_2, X_train, y_train, 5)
[ 1.          0.95        0.98333333  0.98333333  0.93333333]
Mean score:  0.970 (+/- 0.012)

We obtain a mean accuracy of 0.970 with cross-validation if we evaluate on our testing set.

 train_and_evaluate(svc_2, X_train, X_test, y_train, y_test)
Accuracy on training set:
1.0
Accuracy on testing set:
0.99
Classification Report:
             precision    recall  f1-score   support

        0.0       1.00      0.99      0.99        67
        1.0       0.97      1.00      0.99        33

avg / total       0.99      0.99      0.99       100

Confusion Matrix:
[[66  1]
 [ 0 33]]

Could it be possible that our classifer has learned to identify peoples‘ faces associated with glasses and without glasses precisely? How can we be sure that this is not happening and that if we get new unseen faces, it will work as expected? Let‘s separate all the images of the same person, sometimes wearing glasses and sometimes not. We will also separate all the images of the same person, the ones with indexes from 30 to 39, train by using the remaining instances, and evaluate on our new 10 instances set. With this experiment we will try to discard the fact that it is remembering faces, not glassed-related features.

X_test = faces.data[30:40]
y_test = target_glasses[30:40]
print(y_test.shape[0])
select = np.ones(target_glasses.shape[0])
select[30:40] = 0
X_train = faces.data[select == 1]
y_train = target_glasses[select == 1]
print(y_train.shape[0])
svc_3 = SVC(kernel=‘linear‘)
train_and_evaluate(svc_3, X_train, X_test, y_train, y_test)
10
390
Accuracy on training set:
1.0
Accuracy on testing set:
0.9
Classification Report:
             precision    recall  f1-score   support

        0.0       0.83      1.00      0.91         5
        1.0       1.00      0.80      0.89         5

avg / total       0.92      0.90      0.90        10

Confusion Matrix:
[[5 0]
 [1 4]]

From the 10 images, only one error, still pretty good results, let‘s check out which one was incorrectly classifed. First, we have to reshape the data from arrays to 64 x 64 matrices:

y_pred = svc_3.predict(X_test)
eval_faces = [np.reshape(a, (64, 64)) for a in X_test]

Then plot with our print_faces function:

print_faces(eval_faces, y_pred, 10)

The image number 8 in the preceding fgure has glasses and was classifed as no glasses. If we look at that instance, we can see that it is different from the rest of the images with glasses (the border of the glasses cannot be seen clearly and the person is shown with closed eyes), which could be the reason it has been misclassifed.

时间: 2025-01-04 23:18:46

sklearn学习笔记1的相关文章

Sklearn学习笔记

主要记python工具包sklearn的学习内容: 链接点击这里. 一.Regression 1.1. Generalized Linear Models 1.2. Linear and Quadratic Discriminant Analysis 1.3. Kernel ridge regression 二.Classification 三.Clustering 四.Dimensionality reduction 五.Model selection 六.Preprocessing

sklearn学习笔记之简单线性回归

简单线性回归 线性回归是数据挖掘中的基础算法之一,从某种意义上来说,在学习函数的时候已经开始接触线性回归了,只不过那时候并没有涉及到误差项.线性回归的思想其实就是解一组方程,得到回归函数,不过在出现误差项之后,方程的解法就存在了改变,一般使用最小二乘法进行计算. 使用sklearn.linear_model.LinearRegression进行线性回归 sklearn对Data Mining的各类算法已经有了较好的封装,基本可以使用fit.predict.score来训练.评价模型,并使用模型进

sklearn学习笔记(一)

简介   自2007年发布以来,scikit-learn已经成为Python重要的机器学习库了.scikit-learn简称sklearn,支持包括分类.回归.降维和聚类四大机器学习算法.还包含了特征提取.数据处理和模型评估三大模块.  sklearn是Scipy的扩展,建立在NumPy和matplotlib库的基础上.利用这几大模块的优势,可以大大提高机器学习的效率.  sklearn拥有着完善的文档,上手容易,具有着丰富的API,在学术界颇受欢迎.sklearn已经封装了大量的机器学习算法,

sklearn学习笔记2

Text classifcation with Na?ve Bayes In this section we will try to classify newsgroup messages using a dataset that can be retrieved from within scikit-learn. This dataset consists of around 19,000 newsgroup messages from 20 different topics ranging

sklearn学习笔记3

Explaining Titanic hypothesis with decision trees decision trees are very simple yet powerful supervised learning methods, which constructs a decision tree model, which will be used to make predictions. The main advantage of this model is that a huma

sklearn学习笔记之开始

简介 ??自2007年发布以来,scikit-learn已经成为Python重要的机器学习库了.scikit-learn简称sklearn,支持包括分类.回归.降维和聚类四大机器学习算法.还包含了特征提取.数据处理和模型评估三大模块.??sklearn是Scipy的扩展,建立在NumPy和matplotlib库的基础上.利用这几大模块的优势,可以大大提高机器学习的效率.??sklearn拥有着完善的文档,上手容易,具有着丰富的API,在学术界颇受欢迎.sklearn已经封装了大量的机器学习算法,

线性回归学习笔记

操作系统 : CentOS7.3.1611_x64 python版本:2.7.5 sklearn版本:0.18.2 tensorflow版本 :1.2.1 线性回归是利用数理统计中回归分析, 来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法, 运用十分广泛. 其表达形式为y = w'x+e,e为误差服从均值为0的正态分布. 根据变量个数的多少可以分为一元线性回归和多元线性回归. 回归模型中, 一元回归是最简单且稳健的, 但描述复杂系统的行为时往往乏力, 因此基于多元回归的预测技术更

多项式回归学习笔记

操作系统 : CentOS7.3.1611_x64 python版本:2.7.5 sklearn版本:0.18.2 tensorflow版本 :1.2.1 多项式的定义及展现形式 多项式(Polynomial)是代数学中的基础概念,是由称为不定元的变量和称为系数的常数通过有限次加减法.乘法以及自然数幂次的乘方运算得到的代数表达式. 多项式分为一元多项式和多元多项式,其中: 不定元只有一个的多项式称为一元多项式: 不定元不止一个的多项式称为多元多项式. 本文讨论的是一元多项式相关问题. 其一般形式

Query意图分析:记一次完整的机器学习过程(scikit learn library学习笔记)

所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的垂直搜索引擎,专门为用户提供团购.优惠券的检索:同时存在一个通用的搜索引擎,比如百度,通用搜索引擎希望能够识别出一个Query是否具有O2O检索意图,如果有则调用O2O垂直搜索引擎,获取结果作为通用搜索引擎的结果补充. 我们的目的是学习出一个分类器(classifier),分类器可以理解为一个函数,