The scoring parameter: defining model evaluation rules

Model selection and evaluation using tools, such as grid_search.GridSearchCV and cross_validation.cross_val_score,
take a scoring parameter
that controls what metric they apply to the estimators evaluated.


全部的scorer都是越大越好。因此mean_absolute_error and mean_squared_error(測量预測点离模型的距离)是负值。

Scoring Function Comment
‘accuracy’ metrics.accuracy_score  
‘average_precision’ metrics.average_precision_score  
‘f1’ metrics.f1_score for binary targets
‘f1_micro’ metrics.f1_score micro-averaged
‘f1_macro’ metrics.f1_score macro-averaged
‘f1_weighted’ metrics.f1_score weighted average
‘f1_samples’ metrics.f1_score by multilabel sample
‘log_loss’ metrics.log_loss requires predict_proba support
‘precision’ etc. metrics.precision_score suffixes apply as with ‘f1’
‘recall’ etc. metrics.recall_score suffixes apply as with ‘f1’
‘roc_auc’ metrics.roc_auc_score  
‘adjusted_rand_score’ metrics.adjusted_rand_score  
‘mean_absolute_error’ metrics.mean_absolute_error  
‘mean_squared_error’ metrics.mean_squared_error  
‘median_absolute_error’ metrics.median_absolute_error  
‘r2’ metrics.r2_score  


>>> from sklearn import svm, cross_validation, datasets
>>> iris = datasets.load_iris()
>>> X, y =,
>>> model = svm.SVC()
>>> cross_validation.cross_val_score(model, X, y, scoring=‘wrong_choice‘)
Traceback (most recent call last):
ValueError: ‘wrong_choice‘ is not a valid scoring value. Valid options are [‘accuracy‘, ‘adjusted_rand_score‘, ‘average_precision‘, ‘f1‘, ‘f1_macro‘, ‘f1_micro‘, ‘f1_samples‘, ‘f1_weighted‘, ‘log_loss‘, ‘mean_absolute_error‘, ‘mean_squared_error‘, ‘median_absolute_error‘, ‘precision‘, ‘precision_macro‘, ‘precision_micro‘, ‘precision_samples‘, ‘precision_weighted‘, ‘r2‘, ‘recall‘, ‘recall_macro‘, ‘recall_micro‘, ‘recall_samples‘, ‘recall_weighted‘, ‘roc_auc‘]
>>> clf = svm.SVC(probability=True, random_state=0)
>>> cross_validation.cross_val_score(clf, X, y, scoring=‘log_loss‘)
array([-0.07..., -0.16..., -0.06...])


following two rules:

  • It can be called with parameters (estimator, X, y),
    where estimator is the model that should be evaluated, X is
    validation data, and y is the ground truth target for X (in
    the supervised case) or None (in the unsupervised case).
  • It returns a floating point number that quantifies the estimator prediction
    quality on X, with reference to y.
    Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.


Classification metrics

The sklearn.metrics module
implements several loss, score, and utility functions to measure classification performance.

Some of these are restricted to the binary classification case:

matthews_corrcoef(y_true, y_pred) Compute the Matthews correlation coefficient (MCC) for binary classes
precision_recall_curve(y_true, probas_pred) Compute precision-recall pairs for different probability thresholds
roc_curve(y_true, y_score[, pos_label, ...]) Compute Receiver operating characteristic (ROC)

Others also work in the multiclass case:

confusion_matrix(y_true, y_pred[, labels]) Compute confusion matrix to evaluate the accuracy of a classification
hinge_loss(y_true, pred_decision[, labels, ...]) Average hinge loss (non-regularized)

Some also work in the multilabel case:

accuracy_score(y_true, y_pred[, normalize, ...]) Accuracy classification score.
classification_report(y_true, y_pred[, ...]) Build a text report showing the main classification metrics
f1_score(y_true, y_pred[, labels, ...]) Compute the F1 score, also known as balanced F-score or F-measure
fbeta_score(y_true, y_pred, beta[, labels, ...]) Compute the F-beta score
hamming_loss(y_true, y_pred[, classes]) Compute the average Hamming loss.
jaccard_similarity_score(y_true, y_pred[, ...]) Jaccard similarity coefficient score
log_loss(y_true, y_pred[, eps, normalize, ...]) Log loss, aka logistic loss or cross-entropy loss.
precision_recall_fscore_support(y_true, y_pred) Compute precision, recall, F-measure and support for each class
precision_score(y_true, y_pred[, labels, ...]) Compute the precision
recall_score(y_true, y_pred[, labels, ...]) Compute the recall
zero_one_loss(y_true, y_pred[, normalize, ...]) Zero-one classification loss.

And some work with binary and multilabel (but not multiclass) problems:

average_precision_score(y_true, y_score[, ...]) Compute average precision (AP) from prediction scores
roc_auc_score(y_true, y_score[, average, ...]) Compute Area Under the Curve (AUC) from prediction scores

In the following sub-sections, we will describe each of those functions, preceded by some notes on common API and metric definition.

2)accuracy score:

The accuracy_score function
computes the accuracy,

>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
>>> accuracy_score(y_true, y_pred, normalize=False)

对于multilabel classification,仅仅有所有的labels所有预測对。该sample才算预測对。


>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2)))



The confusion_matrix function
evaluates classification accuracy by computing the confusion
. 给个样例:

>>> from sklearn.metrics import confusion_matrix
>>> y_true = [2, 0, 2, 2, 0, 1]
>>> y_pred = [0, 0, 2, 2, 0, 2]
>>> confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]])

(注意:纵轴是true label,横轴是predict label)



The classification_report function
builds a text report showing the main classification metrics. 给个样例:

>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 0]
>>> y_pred = [0, 0, 2, 2, 0]
>>> target_names = [‘class 0‘, ‘class 1‘, ‘class 2‘]
>>> print(classification_report(y_true, y_pred, target_names=target_names))
             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      1.00      1.00         2

avg / total       0.67      0.80      0.72         5




If  is
the predicted value for the -th
of a given sample,  is
the corresponding true value, and  is
the number of classes or labels, then the Hamming loss  between
two samples is defined as:

similarity coefficient score:

The Jaccard similarity coefficient of the -th samples,
with a ground truth label set  and predicted label set ,
is defined as


Several functions allow you to analyze the precision, recall and F-measures score:

average_precision_score(y_true, y_score[, ...]) Compute average precision (AP) from prediction scores
f1_score(y_true, y_pred[, labels, ...]) Compute the F1 score, also known as balanced F-score or F-measure
fbeta_score(y_true, y_pred, beta[, labels, ...]) Compute the F-beta score
precision_recall_curve(y_true, probas_pred) Compute precision-recall pairs for different probability thresholds
precision_recall_fscore_support(y_true, y_pred) Compute precision, recall, F-measure and support for each class
precision_score(y_true, y_pred[, labels, ...]) Compute the precision
recall_score(y_true, y_pred[, labels, ...]) Compute the recall

Note that the precision_recall_curve function
is restricted to the binary case. The average_precision_score function
works only in binary classification and multilabel indicator format.

8)hinge loss:

9)log loss:

correlation coefficient:

operating characteristic(ROC):

12)zero one loss:


Multilabel ranking metrics

In multilabel learning, each sample can have any number of ground truth labels associated with it. The goal is to give
high scores and better rank to the ground truth labels.

1)coverage error:

2)label ranking average precision:


Regression metrics

The sklearn.metrics module
implements several loss, score, and utility functions to measure regression performance.

Some of those have been enhanced to handle the multioutput case: mean_absolute_errormean_squared_errormedian_absolute_error and r2_score.

1)explained variance score:

If  is
the estimated target output,  the
corresponding (correct) target output, and  is Variance,
the square of the standard deviation, then the explained variance is estimated as follow:

2)mean absolute error:

If  is
the predicted value of the -th
sample, and  is
the corresponding true value, then the mean absolute error (MAE) estimated over  is
defined as

3)mean squared error:

If  is
the predicted value of the -th
sample, and  is
the corresponding true value, then the mean squared error (MSE) estimated over  is
defined as

4)R^2 score、the coefficient of determination:

If  is
the predicted value of the -th
sample and  is
the corresponding true value, then the score R2 estimated over  is
defined as

Clustering metrics

The sklearn.metrics module
implements several loss, score, and utility functions. For more information see the Clustering
performance evaluation
 section for instance clustering, and Biclustering
 for biclustering.

6、Dummy estimators

对于supervised learning。使用随机产生的结果作为baseline是非常easy的对照。


  • stratified generates random predictions by respecting the training set class distribution.
  • most_frequent always predicts the most frequent label in the training set.
  • uniform generates predictions uniformly at random.
  • constant always predicts a constant label that is provided by the user.(A
    major motivation of this method is F1-scoring, when the positive class is in the minority.)

Note that with all these strategies, the predict method completely ignores the input data!


first let’s create an imbalanced dataset:


>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import train_test_split
>>> iris = load_iris()
>>> X, y =,
>>> y[y != 1] = -1
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Next, let’s compare the accuracy of SVC and most_frequent:


>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel=‘linear‘, C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
>>> clf = DummyClassifier(strategy=‘most_frequent‘,random_state=0)
>>>, y_train)
DummyClassifier(constant=None, random_state=0, strategy=‘most_frequent‘)
>>> clf.score(X_test, y_test)

We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:


>>> clf = SVC(kernel=‘rbf‘, C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)


DummyRegressor also
implements four simple rules of thumb for regression:

  • mean always predicts the mean of the training targets.
  • median always predicts the median of the training targets.
  • quantile always predicts a user provided quantile of the training targets.
  • constant always predicts a constant value that is provided by the user.

In all these strategies, the predict method completely ignores the input data.

时间: 2024-10-19 17:48:21

scikit-learn:3.3. Model evaluation: quantifying the quality of predictions的相关文章

参考: 训练了模型之后,我们希望可以保存下来,遇到新样本时直接使用已经训练好的保存了的模型,而不用重新再训练模型.本节介绍pickle在保存模型方面的应用.(After training a scikit-learn model, it is desirable to have a way to persist the model for future use without