scikit-learn：7. Computational Performance（计算效能<延迟和吞吐量>）

参考：http://scikit-learn.org/stable/modules/computational_performance.html

对于有些应用，estimators的计算效能（主要指预测新样本时的延迟和吞吐量）非常关键，我们也考虑训练的效能，但由于训练可以offline，所以我们更关注预测时的效能问题。

预测延迟（Prediction latency）：预测一个新样本花费的时间（the
elapsed time necessary to make a prediction）。

预测吞吐量（Prediction throughput）：单位时间内能够预测的新样本数量（the
number of predictions the software can deliver in a given amount of time）。

计算效能的提高往往意味着预测精度的损害（简单模型确实跑得快，但却没有复杂模型能够考虑的properties of the data多）。我们review一些estimators的计算效能的数量级，同时提供一些改进计算瓶颈的方法。

由于大家都不喜欢看介绍，而喜欢方法，先介绍tips和tricks。

3、Tips and Tricks

1）线性代数库：

关注Numpy/Scipy and linear algebra libraries的版本，保证他们built using an
optimized BLAS / LAPACK library.

但并不是所有的运算都受益：(randomized)
decision trees的inner loops、kernel SVMs (SVC, SVR, NuSVC, NuSVR)不受影响；但linear
model（via numpy.dot）会大大改善。

展示NumPy
/ SciPy / scikit-learn install的BLAS / LAPACK库的命令：

from numpy.distutils.system_info import get_info
print(get_info(‘blas_opt‘))
print(get_info(‘lapack_opt‘))

Optimized BLAS / LAPACK implementations include:

Atlas (need hardware specific tuning by rebuilding on the target machine)
OpenBLAS
MKL
Apple Accelerate and vecLib frameworks (OSX only)

2）Model compression（模型压缩）

这里只考虑线性模型，特指将模型的非零系数转换成sparsity，即做到模型和数据都是sparsity。

Here is sample code that illustrates the use of the sparsify() method:

clf = SGDRegressor(penalty=‘elasticnet‘, l1_ratio=0.25)
clf.fit(X_train, y_train).sparsify()
clf.predict(X_test)

In this example we prefer the elasticnet penalty as
it is often a good compromise between model compactness and prediction power. One can also further tune the l1_ratio parameter
(in combination with the regularization strength alpha) to control this tradeoff.

3）model reshaping

Model reshaping consists in
selecting only a portion of the available features to fit a model。 In other words, if a model discards features
during the learning phase we can then strip those from the input.

好处不说了，因为当前这个trick也只能performed
manually in scikit-learn。。。。

4）给大牛们留的菜

1、预测延迟

The main factors that influence the prediction latency are

predictions in bulk or one-at-a-time mode
Number of features
Input data representation and sparsity
Model complexity
Feature extraction

1）predictions in bulk or one-at-a-time mode（整体预测与原子单一预测）

整体预测会比单一预测快上1-2个数量级（原因：branching predictability, CPU cache, linear algebra libraries optimizations etc.），参考：http://scikit-learn.org/stable/modules/computational_performance.html中的两个对比图片。

可喜的是：To benchmark different estimators for your case you can simply change the n_features parameter
in this example: Prediction
Latency. This should give you an estimate of the order of magnitude of the prediction latency.

2）Number
of features

效果图参考：http://scikit-learn.org/stable/modules/computational_performance.html。

Overall
you can expect the prediction time to increase at least linearly with the number of features (non-linear cases can happen depending on the global memory footprint and estimator).

3）Input data representation and sparsity

主要讲述sparse和dense的区别。

sparse的好处：If
you have 100 non zeros in 1e6 dimensional space, you only need 100 multiply and add operation instead of 1e6.

什么情况下使用sparse format：最多10%的非零元素，if
the sparsity ratio is greater than 90% you can probably benefit from sparse formats.

Check
Scipy’s sparse matrix formats documentation for
more information on how to build (or convert your data to) sparse matrix formats. Most of the time the CSR and CSC formats
work best.

好吧，再解释一下CSR/CSC：

csc_matrix(arg1[, shape, dtype, copy])	Compressed Sparse Column matrix
csr_matrix(arg1[, shape, dtype, copy])	Compressed Sparse Row matrix

检测数据sparsity的函数：

def sparsity_ratio(X):
    return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
print("input sparsity ratio:", sparsity_ratio(X))

4）Model
complexity

Generally
speaking, when model complexity increases, predictive power and latency are supposed to increase.（模型越复杂，预测能力越强、但延迟越严重）。

对于sklearn.linear_model (e.g.
Lasso, ElasticNet, SGDClassifier/Regressor, Ridge & RidgeClassifier, PassiveAgressiveClassifier/Regressor, LinearSVC, LogisticRegression...)，预测时的决策函数是一样的 (系数和相应值的点积) ，所以延迟都一样，跟模型复杂度无关。

对于其他模型，具体实验结果参考：http://scikit-learn.org/stable/modules/computational_performance.html中的对比图片。

5）Feature
extraction

（有些应用中提取特征会花很多时间；好吧。。。。刚说了是offline的。。。。）in
many real world applications the feature extraction process (i.e. turning raw data like database rows or network packets into numpy arrays) governs the overall prediction time. For example on the Reuters text classification task the whole preparation
(reading and parsing SGML files, tokenizing the text and hashing it into a common vector space) is taking 100 to 500 times more time than the actual prediction code, depending on the chosen model.具体图片参考：http://scikit-learn.org/stable/modules/computational_performance.html吧。

2、预测吞吐量

具体图片参考：http://scikit-learn.org/stable/modules/computational_performance.html吧。

时间： 2024-12-20 04:53:45

scikit-learn：7. Computational Performance（计算效能<延迟和吞吐量>）

scikit-learn：7. Computational Performance（计算效能<延迟和吞吐量>）的相关文章

Query意图分析：记一次完整的机器学习过程（scikit learn library学习笔记）

Python之扩展包安装（scikit learn）

scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类

Spark技术在京东智能供应链预测的应用——按照业务进行划分，然后利用scikit learn进行单机训练并预测

机器学习-scikit learn学习笔记

Linear Regression with Scikit Learn

Scikit Learn安装教程

Scikit Learn

【359】scikit learn 官方帮助文档