sklearn学习笔记3

Explaining Titanic hypothesis with decision trees

decision trees are very simple yet powerful supervised learning methods, which constructs a decision tree model, which will be used to make predictions.

The main advantage of this model is that a human being can easily understand and reproduce the sequence of decisions (especially if the number of attributes is small) taken to predict the target class of a new instance. This is very important for tasks such as medical diagnosis or credit approval, where we want to show a reason for the decision, rather than just saying this is what the training data suggests (which is, by defnition, what every supervised learning method does).

In this section, we will show you through a working example what decision trees look like, how they are built, and how they are used for prediction.

The problem we would like to solve is to determine if a Titanic‘s passenger would have survived, given her age, passenger class, and sex. Titanic dataset that can be downloaded from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt. Each instance in the dataset has the following form:

"1","1st",1,"Allen, Miss Elisabeth Walton",29.0000,"Southampton","StLouis, MO","B-5","24160 L221","2","female"

The list of attributes is: Ordinal, Class, Survived (0=no, 1=yes), Name, Age, Port of Embarkation, Home/Destination, Room, Ticket, Boat, and Sex. We will start by loading the dataset into a numpy array.

 1 import csv
 2 import numpy as np
 3 with open(‘C:/Users/Administrator/Desktop/data/titanic.csv‘, ‘rt‘) as csvfile:
 4     titanic_reader = csv.reader(csvfile, delimiter=‘,‘, quotechar = ‘"‘)
 5     # Header contains feature names
 6     row = next(titanic_reader)
 7     feature_names = np.array(row)
 8
 9     # Load dataset, and target classes
10     titanic_X, titanic_y = [], []
11     for row in titanic_reader:
12         titanic_X.append(row)
13         titanic_y.append(row[2]) #The target value is "survived"
14     titanic_X = np.array(titanic_X)
15     titanic_y = np.array(titanic_y)

The code shown uses the Python csv module to load the data.

print(feature_names)
print(titanic_X[0], titanic_y[0])
[‘row.names‘ ‘pclass‘ ‘survived‘ ‘name‘ ‘age‘ ‘embarked‘ ‘home.dest‘ ‘room‘
 ‘ticket‘ ‘boat‘ ‘sex‘]
[‘1‘ ‘1st‘ ‘1‘ ‘Allen, Miss Elisabeth Walton‘ ‘29.0000‘ ‘Southampton‘
 ‘St Louis, MO‘ ‘B-5‘ ‘24160 L221‘ ‘2‘ ‘female‘] 1

Preprocessing the data

The frst step we must take is to select the attributes we will use for learning:

# we keep class, age and sex
titanic_X = titanic_X[:, [1, 4, 10]]
feature_names = feature_names[[1, 4, 10]]

We have selected feature numbers 1, 4, and 10 that is class, age, and sex, based on the assumption that the remaining attributes have no effect on the passenger‘s survival.

有时候,特征选择会由手工进行,基于我们对于问题领域的知识和我们选择使用的机器学习方法。有时候,特征选择也会由自动化工具来进行。

Very specifc attributes (such as Name in our case) could result in overftting (consider a tree that just asks if the name is X, she survived); attributes where there is a small number of instances with each value, present a similar problem (they might not be useful for generalization). We will use class, age, and sex because a priori, we expect them to have in?uenced the passenger‘s survival.

Now, our learning data looks like:

print(feature_names)
print(titanic_X[12], titanic_y[12])
[‘pclass‘ ‘age‘ ‘sex‘]
[‘1st‘ ‘NA‘ ‘female‘] 1

这里我们打印了12号样本,因为他提出了一个问题需要解决;他其中一个特征(年龄)缺失。数据集中有缺失值,这是一个很普遍的问题。在这个案例里,我们决定使用训练集中平均年龄来代替缺失值。我们也可以使用不同的方法来解决,例如,使用训练集中的众数或中位数。当我们代替缺失值时,我们必须知道一个问题,我们正在修改原始问题,所以我们不得不对于我们正在做的特别小心。这是机器学习中一个普遍的规则。当我们改变数据的时候,我们应该非常清楚我们正在改变的,避免影响最后结果的准确性。

# We have missing values for age
# Assign the mean value
ages = titanic_X[:, 1]
mean_age = np.mean(titanic_X[ages != ‘NA‘,1].astype(np.float))
titanic_X[titanic_X[:, 1] == ‘NA‘, 1] = mean_age

The implementation of decision trees in scikit-learn expects as input a list of realvalued features, and the decision rules of the model would be of the form:

Feature <= value

For example, age <= 20.0. Our attributes (except for age) are categorical; that is, they correspond to a value taken from a discrete set such as male and female. So, we have to convert categorical data into real values. Let‘s start with the sex feature. The preprocessing module of scikit-learn includes a LabelEncoder class, whose fit method allows conversion of a categorical set into a 0..K-1 integer, where K is the number of different classes in the set (in the case of sex, just 0 or 1):

# Encode sex
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
label_encoder = enc.fit(titanic_X[:, 2])
print("Categorical classes:",label_encoder.classes_)
Categorical classes: [‘female‘ ‘male‘]
integer_classes = label_encoder.transform(label_encoder.classes_)
print("Iteger classes:", integer_classes)
t = label_encoder.transform(titanic_X[:, 2])
titanic_X[:, 2] = t
Iteger classes: [0 1]

The last two sentences transform the values of the sex attribute into 0-1 values, and modify the training set.

print(feature_names)
print(titanic_X[12], titanic_y[12])
[‘pclass‘ ‘age‘ ‘sex‘]
[‘1st‘ ‘31.19418104265403‘ ‘0‘] 1

We still have a categorical attribute: class. We could use the same approach and convert its three classes into 0, 1, and 2. This transformation implicitly introduces an ordering between classes, something that is not an issue in our problem. However, we will try a more general approach that does not assume an ordering, and it is widely used to convert categorical classes into real-valued attributes. We will introduce an additional encoder and convert the class attributes into three new
binary features, each of them indicating if the instance belongs to a feature value (1) or (0). This is called one hot encoding, and it is a very common way of managing categorical attributes for real-based methods:

enc = OneHotEncoder()
one_hot_encoder = enc.fit(integer_classes)
# First, convert classes to 0-(N-1) integers using label_encoder
num_of_rows = titanic_X.shape[0]
t = label_encoder.transform(titanic_X[:, 0]).reshape(num_of_rows, 1)
# Second, create a sparse matrix with three columns, each one indicating if the instance belongs to the class
new_features = one_hot_encoder.transform(t)
# Add the new features to titanix_X
titanic_X = np.concatenate([titanic_X, new_features.toarray()], axis = 1)
#Eliminate converted columns
titanic_X = np.delete(titanic_X, [0], 1)
# Update feature names
feature_names = [‘age‘, ‘sex‘, ‘first_class‘, ‘second_class‘, ‘third_class‘]
# Convert to numerical values
titanic_X = titanic_X.astype(float)
titanic_y = titanic_y.astype(float)

The preceding code first converts the classes into integers and then uses the OneHotEncoder class to create the three new attributes that are added to the array of features. It fnally eliminates from training data the original class feature.

print(feature_names)
print(titanic_X[0], titanic_y[0])
[‘age‘, ‘sex‘, ‘first_class‘, ‘second_class‘, ‘third_class‘]
[ 29.   0.   1.   0.   0.] 1.0

We have now a suitable learning set for scikit-learn to learn a decision tree. Also, standardization is not an issue for decision trees because the relative magnitude of features does not affect the classifer performance.

The preprocessing step is usually underestimated in machine learning methods, but as we can see even in this very simple example, it can take some time to make data look as our methods expect. It is also very important in the overall machine learning process; if we fail in this step (for example, incorrectly encoding attributes, or selecting the wrong features), the following steps will fail, no matter how good the method we use for learning.

Training a decision tree classifer

Now to the interesting part; let‘s build a decision tree from our training data. As usual, we will frst separate training and testing data.

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(titanic_X,titanic_y, test_size=0.25, random_state=33)

Now, we can create a new DecisionTreeClassifier and use the fit method of the classifer to do the learning job.

from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion=‘entropy‘, max_depth=3,min_samples_leaf=5)
clf = clf.fit(X_train,y_train)

DecisionTreeClassifier accepts (as most learning methods) several hyperparameters that control its behavior. In this case, we used the Information Gain (IG) criterion for splitting learning data, told the method to build a tree of at most three levels, and to accept a node as a leaf if it includes at least fve training instances. To explain this and show how decision trees work, let‘s visualize the model built. The following code assumes you are using IPython and that your Python distribution includes the pydot module. Also, it allows generation of Graphviz code from the tree and assumes that Graphviz itself is installed. For more information about Graphviz, please refer to http://www.graphviz.org/.

import pydotplus
from io import StringIO
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data, feature_names=[‘age‘,‘sex‘,‘1st_class‘,‘2nd_class‘, ‘3rd_class‘])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(‘titanic.png‘)
from IPython.core.display import Image
Image(filename=‘titanic.png‘)

You might be asking how our method decides which questions should be asked in each step. The answer is Information Gain (IG) (or the Gini index, which is a similar measure of disorder used by scikit-learn). IG measures how much entropy we lose if we answer the question, or alternatively, how much surer we are after answering it. Entropy is a measure of disorder in a set, if we have zero entropy, it means all values are the same (in our case, all instances of the target classes are the same), while it reaches its maximum when there is an equal number of instances of each class (in our case, when half of the instances correspond to survivors and the other half to non survivors). At each node, we have a certain number of instances (starting from the whole dataset), and we measure its entropy. Our method will select the questions that yield more homogeneous partitions (with the lowest entropy), when we consider only those instances for which the answer for the question is yes or no, that is, when the entropy after answering the question decreases.

Interpreting the decision tree

As you can see in the tree, at the beginning of the decision tree growing process, you have the 984 instances in the training set, 662 of them corresponding to class 0 (fatalities), and 322 of them to class 1 (survivors). The measured entropy for this initial group is about 0.9121. From the possible list of questions we can ask, the one that produces the greatest information gain is: Was she a woman? (remember that the female category was encoded as 0). If the answer is yes, entropy is almost the same, but if the answer is no, it is greatly reduced (the proportion of men who died was much greater than the general proportion of casualties). In this sense, the woman question seems to be the best to ask. After that, the process continues, working in each node only with the instances that have feature values that correspond to the questions in the path to the node.

If you look at the tree, in each node we have: the question, the initial Shannon entropy, the number of instances we are considering, and their distribution with respect to the target class. In each step, the number of instances gets reduced to those that answer yes (the left branch) and no (the right branch) to the question posed by that node. The process continues until a certain stopping criterion is met (in our case, until we have a fourth-level node, or the number of considered samples is lower than fve).

At prediction time, we take an instance and start traversing the tree, answering the questions based on the instance features, until we reach a leaf. At this point, we look at to how many instances of each class we had in the training set, and select the class to which most instances belonged.

For example, consider the question of determining if a 10-year-old girl, from first class would have survived. The answer to the frst question (was she female?) is yes, so we take the left branch of the tree. In the two following questions the answers are no (was she from third class?) and yes (was she from frst class?), so we take the left and right branch respectively. At this time, we have reached a leaf. In the training set, we had 102 people with these attributes, 97 of them survivors. So, our answer would be survived.

In general, we found reasonable results: the group with more casualties (449 from 496) corresponded to adult men from second or third class, as you can check in the tree. Most girls from frst class, on the other side, survived. Let‘s measure the accuracy of our method in the training set (we will frst defne a helper function to measure the performance of a classifer):

from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True,
                         show_classification_report=True, show_confusion_matrix=True):
    y_pred=clf.predict(X)
    if show_accuracy:
        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n")
    if show_classification_report:
        print("Classification report")
        print(metrics.classification_report(y,y_pred),"\n")
    if show_confusion_matrix:
        print("Confussion matrix")
        print(metrics.confusion_matrix(y,y_pred),"\n")
measure_performance(X_train,y_train,clf,show_classification_report=False, show_confusion_matrix=False)
Accuracy:0.838 

Our tree has an accuracy of 0.838 on the training set. But remember that this is not a good indicator. This is especially true for decision trees as this method is highly susceptible to overftting. Since we did not separate an evaluation set, we should apply cross-validation. For this example, we will use an extreme case of crossvalidation, named leave-one-out cross-validation. For each instance in the training sample, we train on the rest of the sample, and evaluate the model built on the only instance left out. After performing as many classifcations as training instances, we calculate the accuracy simply as the proportion of times our method correctly predicted the class of the left-out instance, and found it is a little lower (as we expected) than the resubstitution accuracy on the training set.

from sklearn.cross_validation import cross_val_score, LeaveOneOut
from scipy.stats import sem
def loo_cv(X_train, y_train,clf):
    # Perform Leave-One-Out cross validation
    # We are preforming 1313 classifications!
    loo = LeaveOneOut(X_train[:].shape[0])
    scores = np.zeros(X_train[:].shape[0])
    for train_index, test_index in loo:
        X_train_cv, X_test_cv = X_train[train_index], X_train[test_index]
        y_train_cv, y_test_cv = y_train[train_index], y_train[test_index]
        clf = clf.fit(X_train_cv,y_train_cv)
        y_pred = clf.predict(X_test_cv)
        scores[test_index] = metrics.accuracy_score(y_test_cv.astype(int), y_pred.astype(int))
    print(("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores)))
loo_cv(X_train, y_train, clf)
Mean score: 0.837 (+/-0.012)

The main advantage of leave-one-out cross-validation is that it allows almost as much data for training as we have available, so it is particularly well suited for those cases where data is scarce. Its main problem is that training a different classifer for each instance could be very costly in terms of the computation time.

A big question remains here: how we selected the hyperparameters for our method instantiation? This problem is a general one, it is called model selection.

时间: 2024-10-13 01:42:58

sklearn学习笔记3的相关文章

Sklearn学习笔记

主要记python工具包sklearn的学习内容: 链接点击这里. 一.Regression 1.1. Generalized Linear Models 1.2. Linear and Quadratic Discriminant Analysis 1.3. Kernel ridge regression 二.Classification 三.Clustering 四.Dimensionality reduction 五.Model selection 六.Preprocessing

sklearn学习笔记之简单线性回归

简单线性回归 线性回归是数据挖掘中的基础算法之一,从某种意义上来说,在学习函数的时候已经开始接触线性回归了,只不过那时候并没有涉及到误差项.线性回归的思想其实就是解一组方程,得到回归函数,不过在出现误差项之后,方程的解法就存在了改变,一般使用最小二乘法进行计算. 使用sklearn.linear_model.LinearRegression进行线性回归 sklearn对Data Mining的各类算法已经有了较好的封装,基本可以使用fit.predict.score来训练.评价模型,并使用模型进

sklearn学习笔记(一)

简介   自2007年发布以来,scikit-learn已经成为Python重要的机器学习库了.scikit-learn简称sklearn,支持包括分类.回归.降维和聚类四大机器学习算法.还包含了特征提取.数据处理和模型评估三大模块.  sklearn是Scipy的扩展,建立在NumPy和matplotlib库的基础上.利用这几大模块的优势,可以大大提高机器学习的效率.  sklearn拥有着完善的文档,上手容易,具有着丰富的API,在学术界颇受欢迎.sklearn已经封装了大量的机器学习算法,

sklearn学习笔记2

Text classifcation with Na?ve Bayes In this section we will try to classify newsgroup messages using a dataset that can be retrieved from within scikit-learn. This dataset consists of around 19,000 newsgroup messages from 20 different topics ranging

sklearn学习笔记1

Image recognition with Support Vector Machines #our dataset is provided within scikit-learn #let's start by importing and printing its description import sklearn as sk import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fe

sklearn学习笔记之开始

简介 ??自2007年发布以来,scikit-learn已经成为Python重要的机器学习库了.scikit-learn简称sklearn,支持包括分类.回归.降维和聚类四大机器学习算法.还包含了特征提取.数据处理和模型评估三大模块.??sklearn是Scipy的扩展,建立在NumPy和matplotlib库的基础上.利用这几大模块的优势,可以大大提高机器学习的效率.??sklearn拥有着完善的文档,上手容易,具有着丰富的API,在学术界颇受欢迎.sklearn已经封装了大量的机器学习算法,

线性回归学习笔记

操作系统 : CentOS7.3.1611_x64 python版本:2.7.5 sklearn版本:0.18.2 tensorflow版本 :1.2.1 线性回归是利用数理统计中回归分析, 来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法, 运用十分广泛. 其表达形式为y = w'x+e,e为误差服从均值为0的正态分布. 根据变量个数的多少可以分为一元线性回归和多元线性回归. 回归模型中, 一元回归是最简单且稳健的, 但描述复杂系统的行为时往往乏力, 因此基于多元回归的预测技术更

多项式回归学习笔记

操作系统 : CentOS7.3.1611_x64 python版本:2.7.5 sklearn版本:0.18.2 tensorflow版本 :1.2.1 多项式的定义及展现形式 多项式(Polynomial)是代数学中的基础概念,是由称为不定元的变量和称为系数的常数通过有限次加减法.乘法以及自然数幂次的乘方运算得到的代数表达式. 多项式分为一元多项式和多元多项式,其中: 不定元只有一个的多项式称为一元多项式: 不定元不止一个的多项式称为多元多项式. 本文讨论的是一元多项式相关问题. 其一般形式

Query意图分析:记一次完整的机器学习过程(scikit learn library学习笔记)

所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的垂直搜索引擎,专门为用户提供团购.优惠券的检索:同时存在一个通用的搜索引擎,比如百度,通用搜索引擎希望能够识别出一个Query是否具有O2O检索意图,如果有则调用O2O垂直搜索引擎,获取结果作为通用搜索引擎的结果补充. 我们的目的是学习出一个分类器(classifier),分类器可以理解为一个函数,