一、1.1 metrics评估指标


* 以_score结尾的为模型得分,一般情况越大越好
* 以_error或_loss结尾的为模型的偏差,一般情况越小越好


二、1.2 测试回归模型

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn import datasets
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')
* explained_variance_score(y_true, y_pred, sample_weight=None, multioutput='uniform_average'):回归方差(反应自变量与因变量之间的相关程度)
* mean_absolute_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average'):平均绝对值误差
* mean_squared_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average'):均方差
* median_absolute_error(y_true, y_pred):中值绝对误差
* r2_score(y_true, y_pred, sample_weight=None, multioutput='uniform_average'):R平方值

2.1 1.2.1 r2_socre

R^2 = 1-{\frac {{\frac{1}{n}\sum_{i=1}^n(y^{(i)}-\hat{y^{(i)}})^2}} {{\frac{1}{n}}\sum_{i=1}^n(y^{(i)}-\mu_{(y)})^2} }
R^2 = 1-{\frac{MSE}{Var(y)}}

# 报告决定系数得分
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

boston = datasets.load_boston()
X = boston.data
y = boston.target

lr = LinearRegression()
lr.fit(X, y)
lr_predict = lr.predict(X)

lr_r2 = r2_score(y, lr_predict)

2.2 1.2.1 explained_variance_score

# 解释方差示例
from sklearn.linear_model import LinearRegression
from sklearn.metrics import explained_variance_score

boston = datasets.load_boston()
X = boston.data
y = boston.target

lr = LinearRegression()
lr.fit(X, y)
lr_predict = lr.predict(X)

ex_var = explained_variance_score(y, lr_predict)

三、1.3 测试分类模型


* accuracy_score(y_true,y_pre): 精度
* auc(x, y, reorder=False): ROC曲线下的面积;较大的AUC代表了较好的performance。
* average_precision_score(y_true, y_score, average='macro', sample_weight=None):根据预测得分计算平均精度(AP)
* brier_score_loss(y_true, y_prob, sample_weight=None, pos_label=None):越小的brier_score,模型效果越好
* confusion_matrix(y_true, y_pred, labels=None, sample_weight=None):通过计算混淆矩阵来评估分类的准确性 返回混淆矩阵
* f1_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None): F1值
* log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None, labels=None):对数损耗,又称逻辑损耗或交叉熵损耗
* precision_score(y_true, y_pred, labels=None, pos_label=1, average='binary',):查准率或者精度; precision(查准率)=TP/(TP+FP)
* recall_score(y_true, y_pred, labels=None, pos_label=1, average='binary', sample_weight=None):查全率 ;recall(查全率)=TP/(TP+FN)
* roc_auc_score(y_true, y_score, average='macro', sample_weight=None):计算ROC曲线下的面积就是AUC的值,the larger the better
* roc_curve(y_true, y_score, pos_label=None, sample_weight=None, drop_intermediate=True);计算ROC曲线的横纵坐标值,TPR,FPR

二分类问题中根据样例的真实类别和模型预测类别的组合划分为真正例(true positive)、假正例(false positive)、真反例(true negative)、假反例(false negative)四种情形,令TP、FP、TN、FN分别表示对应的样例数,\(样例总数 = TP+FP+TN+FN\)。

  • TP——将正类预测为正类数
  • FP——将负类预测为正类数
  • TN——将负类预测为负类数
  • FN——将正类预测为负类数
误差矩阵 - - -
- - 真实值 真实值
- - 1 0
预测值 1 True Positive(TP) False Positive(FP)
预测值 0 True Negative(TN) False Negative(FN)

3.1 1.3.1 准确度

P = {\frac{TP+FN}{TP+FP+TN+FN}} = \frac{正确预测的样本数}{样本总数}

# 查准率示例
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
    accuracy_score(y, y_pred)))

3.2 1.3.2 查准率

P = {\frac{TP}{TP+FP}} = \frac{正确预测为正类的样本数}{预测为正类的样本总数}

# 查准率示例
from sklearn import datasets
from sklearn.metrics import precision_score
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
    precision_score(y, y_pred, average='weighted')))

3.3 1.3.3 查全率

R = {\frac{TP}{TP+FN}} = \frac{正确预测为正类的样本数}{正类总样本数}

# 查全率示例
from sklearn.metrics import recall_score
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
print('查全率:{:.2f}'.format(recall_score(y, y_pred, average='weighted')))

3.4 1.3.4 F1值



F_1 = {\frac{2*P*R}{P+R}} = {\frac{2*TP}{2TP+FP+FN}} = {\frac{2*TP}{样例总数+TP-TN}}

F_\beta = {\frac{(1+\beta^2)*P*R}{\beta^2*P+R}}


  1. 当\(\beta<1\)时,\(P\)的权重减小,即\(R\)查准率更重要
  2. 当\(\beta=1\)时,\(F_\beta = F_1\)
  3. 当\(\beta>1\)时,\(P\)的权重增大,即\(P\)查全率更重要
# F1值示例
from sklearn import datasets
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data
y = iris_data.target

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
print('F1值:{:.2f}'.format(f1_score(y, y_pred, average='weighted')))

3.5 1.3.5 ROC曲线

ROC(receiver operating characteristic,ROC)曲线也可以度量模型性能的好坏,ROC曲线顾名思义是一条曲线,它的横轴是假正例率(false positive rate,FPR),纵轴是真正例率(true positive rate,TPR),假正例率和真正例率分别定义为:
FPR = {\frac{FP}{FP+TN}} \text{假正例率} \TPR = {\frac{TP}{TP+FN}} \text{真正例率}

# ROC示例
from sklearn import datasets
from sklearn.metrics import roc_curve
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data[0:100, :]
y = iris_data.target[0:100]

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
fpr, tpr, thresholds = roc_curve(y, y_pred)
plt.xlabel('FPR', fontsize=15)
plt.ylabel('TPR', fontsize=15)
plt.title('FPR-TPR', fontsize=20)
plt.plot(fpr, tpr)

3.6 1.3.6 AUC面积

由于ROC曲线有时候无法精准度量模型的好坏,因此会使用ROC曲线关于横纵轴围成的面积称为AUC(area under ROC curve,AUC)来度量模型的好坏,AUC值越大的模型,则模型越优。

# AUC示例
from sklearn import datasets
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

iris_data = datasets.load_iris()
X = iris_data.data[0:100, :]
y = iris_data.target[0:100]

lr = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
lr = lr.fit(X, y)

y_pred = lr.predict(X)
# 计算AUC值
print('AUC值:{:.2f}'.format(roc_auc_score(y, y_pred, average='weighted')))

四、1.4 欠拟合和过拟合


# 过拟合图例
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')
%matplotlib inline

# 自定义数据并处理数据
data_frame = {'x': [2, 1.5, 3, 3.2, 4.22, 5.2, 6, 6.7],
              'y': [0.5, 3.5, 5.5, 5.2, 5.5, 5.7, 5.5, 6.25]}
df = pd.DataFrame(data_frame)
X, y = df.iloc[:, 0].values.reshape(-1, 1), df.iloc[:, 1].values.reshape(-1, 1)

# 线性回归
lr = LinearRegression()
lr.fit(X, y)

def poly_lr(degree):
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    lr_poly = LinearRegression()
    lr_poly.fit(X_poly, y)
    y_pred_poly = lr_poly.predict(X_poly)

    return y_pred_poly

def plot_lr():
    plt.scatter(X, y, c='k', edgecolors='white', s=50)
    plt.plot(X, lr.predict(X), color='r', label='lr')
    # 噪声
    plt.scatter(2, 0.5, c='r')
    plt.text(2, 0.5, s='$(2,0.5)$')

    plt.xlim(0, 7)
    plt.ylim(0, 8)

def plot_poly(degree, color):
    plt.scatter(X, y, c='k', edgecolors='white', s=50)
    plt.plot(X, poly_lr(degree), color=color, label='m={}'.format(degree))
    # 噪声
    plt.scatter(2, 0.5, c='r')
    plt.text(2, 0.5, s='$(2,0.5)$')

    plt.xlim(0, 7)
    plt.ylim(0, 8)

def run():
    plt.title('图1(线性回归)', fontproperties=font, color='r', fontsize=12)
    plt.title('图2(一阶多项式回归)', fontproperties=font, color='r', fontsize=12)
    plot_poly(1, 'orange')
    plt.title('图3(三阶多项式回归)', fontproperties=font, color='r', fontsize=12)
    plot_poly(3, 'gold')
    plt.title('图4(五阶多项式回归)', fontproperties=font, color='r', fontsize=12)
    plot_poly(5, 'green')
    plt.title('图5(七阶多项式回归)', fontproperties=font, color='r', fontsize=12)
    plot_poly(7, 'blue')
    plt.title('图6(十阶多项式回归)', fontproperties=font, color='r', fontsize=12)
    plot_poly(10, 'violet')



  • 图1:线性回归拟合样本点,可以发现样本点距离拟合曲线很远,这个时候一般称作欠拟合(underfitting)
  • 图2:一阶多项式回归拟合样本点,等同于线性回归
  • 图3:三阶多项式回归拟合样本点,表现还不错
  • 图4:五阶多项式回归拟合样本点,明显过拟合
  • 图5:七阶多项式回归拟合样本点,已经拟合了所有的样本点,毋庸置疑的过拟合
  • 图7:十阶多项式回归拟合样本点,拟合样本点的曲线和七阶多项式已经没有了区别,可以想象十阶之后的曲线也类似于七阶多项式的拟合曲线


4.1 4.9.4 交叉验证



4.1.1 简单交叉验证


  1. 初始值\(c=1\)
  2. 训练模型
  3. 测试模型,\(c+1\)
  4. 如果\(c<11\)改变模型参数,跳转到步骤1;反之,停止训练
  5. 从模型集\(\{c_1,c_2,\cdots,c_{10}\}\)中选择性能最优的模型
# 简单交叉验证
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

# 导入鸢尾花数据
iris_data = datasets.load_iris()
X = iris_data.data[:, [0, 1]]
y = iris_data.target

# random_state=1可以确保结果不随机,stratify=y可以确保每个分类的结果都有相同的比例
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

不同类别所有样本数量:[50 50 50]
不同类别训练数据数量:[35 35 35]
不同类别测试数据数量:[15 15 15]

4.1.2 分层k折交叉验证


  1. 将数据分为\(k\)个子集
  2. 选择\(k-1\)个子集训练模型
  3. 选择另一个子集测试模型
  4. 重复2-3步,直至有\(k\)个模型
  5. 选择\(k\)个模型中性能最优的模型
# k折交叉验证
import numpy as np
from sklearn import datasets
# StratifiedKFold会按照原有标签的分布情况对数据分层
from sklearn.model_selection import StratifiedKFold

# 导入鸢尾花数据
iris_data = datasets.load_iris()
X = iris_data.data[:, [0, 1]]
y = iris_data.target

# n_splits=10相当于k=10
kfold = StratifiedKFold(n_splits=3, random_state=1)
kfold = kfold.split(X, y)

for k, (train_data, test_data) in enumerate(kfold):
    print('迭代次数:{}'.format(k), '训练数据长度:{}'.format(
        len(train_data)), '测试数据长度:{}'.format(len(test_data)))
[ 17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  67  68  69
  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87
  88  89  90  91  92  93  94  95  96  97  98  99 117 118 119 120 121 122
 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
 51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116]
  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66 100 101
 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116]
迭代次数:0 训练数据长度:99 测试数据长度:51
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  34
  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52
  53  54  55  56  57  58  59  60  61  62  63  64  65  66  84  85  86  87
  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105
 106 107 108 109 110 111 112 113 114 115 116 134 135 136 137 138 139 140
 68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83 117 118
 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133]
  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83 117 118
 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133]
迭代次数:1 训练数据长度:99 测试数据长度:51
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  50  51
  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69
  70  71  72  73  74  75  76  77  78  79  80  81  82  83 100 101 102 103
 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121
 86  87  88  89  90  91  92  93  94  95  96  97  98  99 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]
  86  87  88  89  90  91  92  93  94  95  96  97  98  99 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]
迭代次数:2 训练数据长度:102 测试数据长度:48

4.1.3 随机排列交叉验证

# k折交叉验证
import numpy as np
from sklearn import datasets
# StratifiedKFold会按照原有标签的分布情况对数据分层
from sklearn.model_selection import ShuffleSplit

# 导入鸢尾花数据
iris_data = datasets.load_iris()
X = iris_data.data[:, [0, 1]]
y = iris_data.target

# n_splits=10相当于k=10
kfold = ShuffleSplit(n_splits=3, random_state=1)
kfold = kfold.split(X)

for k, (train_data, test_data) in enumerate(kfold):
    print('迭代次数:{}'.format(k), '训练数据长度:{}'.format(
        len(train_data)), '测试数据长度:{}'.format(len(test_data)))
[ 42  92  66  31  35  90  84  77  40 125  99  33  19  73 146  91 135  69
 128 114  48  53  28  54 108 112  17 119 103  58 118  18   4  45  59  39
  36 117 139 107 132 126  85 122  95  11 113 123  12   2 104   6 127 110
  65  55 144 138  46  62  74 116  93 100  89  10  34  32 124  38  83 111
 149  27  23  67   9 130  97 105 145  87 148 109  64  15  82  41  80  52
  26  76  43  24 136 121 143  49  21  70   3 142  30 147 106  47 115  13
  88   8  81  60   0   1  57  22  61  63   7  86  96  68  50 101  20  25
 134  71 129  79 133 137  72 140  37]
迭代次数:0 训练数据长度:135 测试数据长度:15
[ 18  37  59 111  65 119 127 102 121 118  90 146   3  51 100 133 105  23
  57 123  49   9  72 126 124 145  68 143   6  13 120  89 135  22  99  92
 130  39  58  81  52 117   4  17 138  97  70 109 148  42  73 115   5  76
  38  86 122  80  95  34  60 129 112   7  26  19  14  30  15  44  20 137
 107  64  41  79  50 131 108 144 104   8  74  94 103  31  82  55 125  32
  54  48  83 149   2  33  93 136  35  75  63  29   0  46  78  66 140  67
 128 106  28  16  87  45  47 113  77  40  21 101  69  53  24 134  43 116
 141 142  25 147  56  61  96  10  84]
迭代次数:1 训练数据长度:135 测试数据长度:15
[ 62 135  20  56  77  55  65  87   5  97 117  10 142  74  17  12  45 102
  50  96 124  48   8  47 122 148  29 130  71 147   7 128 104  91 140  79
  60 136  86  67  33  68   0 129  49 121  99  32  59 110 101  14   6 123
 108  37 107 111  21  26  42  58  75  78  90 145 139  63  38  18  40 119
 100 126 134  28  72 144  80  46 113 149  85   2  81 116  35 115 138 137
  16 125 105  11 120 141  76  93 109  88  57  41   9  53  95 106  92  66
  22  23  36  13 132  61  83  39  70 131 146  98  64 103  30  84  94 127
  82   1  43  27  89  52  73  69 112]
迭代次数:2 训练数据长度:135 测试数据长度:15

4.1.4 留一法交叉验证


# 留一法交叉验证
import numpy as np
from sklearn import datasets
from sklearn.model_selection import LeaveOneOut

# 导入鸢尾花数据
iris_data = datasets.load_iris()
X = iris_data.data[:, [0, 1]]
y = iris_data.target

loo = LeaveOneOut()
count = 0
for train_index, test_index in loo.split(X):
    if count < 10:
        print("训练集长度:", len(train_index), "测试集长度:", len(test_index))
    count += 1
    if count == loo.get_n_splits(X)-1:
        print('...\n迭代次数:', count)
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
训练集长度: 149 测试集长度: 1
迭代次数: 149

4.1.5 时间序列分割


from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [2, 4], [3, 2], [2, 4], [1, 2], [3, 2]])
y = np.array([1, 3, 3, 4, 5, 4])
# max_train_size指训练数据个数,n_splits指切割次数
tscv = TimeSeriesSplit(n_splits=5, max_train_size=3)
TimeSeriesSplit(max_train_size=3, n_splits=5)
for train_index, test_index in tscv.split(X):
    print("训练数据索引:", train_index, "测试数索引:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
训练数据索引: [0] 测试数索引: [1]
训练数据索引: [0 1] 测试数索引: [2]
训练数据索引: [0 1 2] 测试数索引: [3]
训练数据索引: [1 2 3] 测试数索引: [4]
训练数据索引: [2 3 4] 测试数索引: [5]

五、1.5 交叉验证和模型一起使用


5.1 1.5.1 cross_val_score


from sklearn.metrics import SCORERS

# 可以使用的评分方法
dict_keys(['explained_variance', 'r2', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'accuracy', 'roc_auc', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'brier_score_loss', 'adjusted_rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted'])
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
       0.93333333, 0.93333333, 1.        , 1.        , 1.        ])
print('准确率:{:.4f}(+/-{:.4f})'.format(scores.mean(), scores.std()*2))

5.2 1.5.2 cross_validate


from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
cross_validate(clf, X, y, cv=10, scoring=[
    'accuracy', 'recall_weighted'], return_train_score=True)
{'fit_time': array([0.04038572, 0.06277108, 0.07863808, 0.03404975, 0.03079391,
        0.04499412, 0.04462409, 0.06048512, 0.05675983, 0.03511214]),
 'score_time': array([0.00144005, 0.00148797, 0.00143886, 0.00105596, 0.00098372,
        0.00138307, 0.00099993, 0.00111103, 0.0020051 , 0.00080705]),
 'test_accuracy': array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
        0.93333333, 0.93333333, 1.        , 1.        , 1.        ]),
 'train_accuracy': array([0.97037037, 0.97777778, 0.97037037, 0.97037037, 0.97777778,
        0.97777778, 0.98518519, 0.97037037, 0.97037037, 0.97777778]),
 'test_recall_weighted': array([1.        , 0.93333333, 1.        , 1.        , 0.93333333,
        0.93333333, 0.93333333, 1.        , 1.        , 1.        ]),
 'train_recall_weighted': array([0.97037037, 0.97777778, 0.97037037, 0.97037037, 0.97777778,
        0.97777778, 0.98518519, 0.97037037, 0.97037037, 0.97777778])}

5.3 1.5.3 cross_val_predict


from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=1000)
per_sample = cross_val_predict(clf, X, y, cv=10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
from sklearn.metrics import accuracy_score

accuracy_score(y, per_sample)

六、1.6 模型特定交叉验证


from sklearn import datasets
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso, LassoCV

boston = datasets.load_boston()
X = boston.data
y = boston.target

reg = Lasso()
reg.fit(X, y)
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
y_pred = reg.predict(X)

'报告决定系数:{:.2f}'.format(r2_score(y, y_pred))
reg = LassoCV(cv=5)
reg.fit(X, y)
LassoCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=None, normalize=False,
    positive=False, precompute='auto', random_state=None,
    selection='cyclic', tol=0.0001, verbose=False)
y_pred = reg.predict(X)

'报告决定系数:{:.2f}'.format(r2_score(y, y_pred))


