scikit-learn:7. Computational Performance(计算效能<延迟和吞吐量>)

参考:http://scikit-learn.org/stable/modules/computational_performance.html

对于有些应用,estimators的计算效能(主要指预测新样本时的延迟和吞吐量)非常关键,我们也考虑训练的效能,但由于训练可以offline,所以我们更关注预测时的效能问题。

预测延迟(Prediction latency):预测一个新样本花费的时间(the
elapsed time necessary to make a prediction)。

预测吞吐量(Prediction throughput):单位时间内能够预测的新样本数量(the
number of predictions the software can deliver in a given amount of time)。

计算效能的提高往往意味着预测精度的损害(简单模型确实跑得快,但却没有复杂模型能够考虑的properties of the data多)。我们review一些estimators的计算效能的数量级,同时提供一些改进计算瓶颈的方法

由于大家都不喜欢看介绍,而喜欢方法,先介绍tips和tricks。

3、Tips and Tricks

1)线性代数库:

关注Numpy/Scipy and linear algebra libraries的版本,保证他们built using an
optimized BLAS / LAPACK
 library.

但并不是所有的运算都受益:(randomized)
decision trees的inner loops、kernel SVMs (SVCSVRNuSVCNuSVR)不受影响;但linear
model(via numpy.dot)会大大改善。

展示NumPy
/ SciPy / scikit-learn install的BLAS / LAPACK库的命令:

from numpy.distutils.system_info import get_info
print(get_info(‘blas_opt‘))
print(get_info(‘lapack_opt‘))
Optimized BLAS / LAPACK implementations include:
  • Atlas (need hardware specific tuning by rebuilding on the target machine)
  • OpenBLAS
  • MKL
  • Apple Accelerate and vecLib frameworks (OSX only)

2)Model compression(模型压缩)

这里只考虑线性模型,特指将模型的非零系数转换成sparsity,即做到模型和数据都是sparsity。

Here is sample code that illustrates the use of the sparsify() method:

clf = SGDRegressor(penalty=‘elasticnet‘, l1_ratio=0.25)
clf.fit(X_train, y_train).sparsify()
clf.predict(X_test)

In this example we prefer the elasticnet penalty as
it is often a good compromise between model compactness and prediction power. One can also further tune the l1_ratio parameter
(in combination with the regularization strength alpha) to control this tradeoff.

3)model reshaping

Model reshaping consists in
selecting only a portion of the available features to fit a model。 In other words, if a model discards features
during the learning phase we can then strip those from the input.

好处不说了,因为当前这个trick也只能performed
manually in scikit-learn。。。。

4)给大牛们留的菜

1、预测延迟

The main factors that influence the prediction latency are
  1. predictions in bulk or one-at-a-time mode
  2. Number of features
  3. Input data representation and sparsity
  4. Model complexity
  5. Feature extraction

1)predictions in bulk or one-at-a-time mode(整体预测与原子单一预测)

整体预测会比单一预测快上1-2个数量级(原因:branching predictability, CPU cache, linear algebra libraries optimizations etc.),参考:http://scikit-learn.org/stable/modules/computational_performance.html中的两个对比图片。

可喜的是:To benchmark different estimators for your case you can simply change the n_features parameter
in this example: Prediction
Latency
. This should give you an estimate of the order of magnitude of the prediction latency.

2)Number
of features

效果图参考:http://scikit-learn.org/stable/modules/computational_performance.html。

Overall
you can expect the prediction time to increase at least linearly with the number of features (non-linear cases can happen depending on the global memory footprint and estimator).

3)Input data representation and sparsity

主要讲述sparse和dense的区别。

sparse的好处:If
you have 100 non zeros in 1e6 dimensional space, you only need 100 multiply and add operation instead of 1e6.

什么情况下使用sparse format:最多10%的非零元素,if
the sparsity ratio is greater than 90% you can probably benefit from sparse formats.

Check
Scipy’s sparse matrix formats documentation for
more information on how to build (or convert your data to) sparse matrix formats. Most of the time the CSR and CSC formats
work best.

好吧,再解释一下CSR/CSC:

csc_matrix(arg1[, shape, dtype, copy]) Compressed Sparse Column matrix
csr_matrix(arg1[, shape, dtype, copy]) Compressed Sparse Row matrix

检测数据sparsity的函数:

def sparsity_ratio(X):
    return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
print("input sparsity ratio:", sparsity_ratio(X))

4)Model
complexity

Generally
speaking, when model complexity increases, predictive power and latency are supposed to increase.(模型越复杂,预测能力越强、但延迟越严重)。

对于sklearn.linear_model (e.g.
Lasso, ElasticNet, SGDClassifier/Regressor, Ridge & RidgeClassifier, PassiveAgressiveClassifier/Regressor, LinearSVC, LogisticRegression...), 预测时的决策函数是一样的 (系数和相应值的点积) ,所以延迟都一样,跟模型复杂度无关。

对于其他模型,具体实验结果参考:http://scikit-learn.org/stable/modules/computational_performance.html中的对比图片。

5)Feature
extraction

(有些应用中提取特征会花很多时间;好吧。。。。刚说了是offline的。。。。)in
many real world applications the feature extraction process (i.e. turning raw data like database rows or network packets into numpy arrays) governs the overall prediction time. For example on the Reuters text classification task the whole preparation
(reading and parsing SGML files, tokenizing the text and hashing it into a common vector space) is taking 100 to 500 times more time than the actual prediction code, depending on the chosen model.具体图片参考:http://scikit-learn.org/stable/modules/computational_performance.html吧。

2、预测吞吐量

具体图片参考:http://scikit-learn.org/stable/modules/computational_performance.html吧。

版权声明:本文为博主原创文章,未经博主允许不得转载。

时间: 2024-12-20 04:53:45

scikit-learn:7. Computational Performance(计算效能<延迟和吞吐量>)的相关文章

Query意图分析:记一次完整的机器学习过程(scikit learn library学习笔记)

所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的垂直搜索引擎,专门为用户提供团购.优惠券的检索:同时存在一个通用的搜索引擎,比如百度,通用搜索引擎希望能够识别出一个Query是否具有O2O检索意图,如果有则调用O2O垂直搜索引擎,获取结果作为通用搜索引擎的结果补充. 我们的目的是学习出一个分类器(classifier),分类器可以理解为一个函数,

Python之扩展包安装(scikit learn)

scikit learn 是Python下开源的机器学习包.(安装环境:win7.0 32bit和Python2.7) Python安装第三方扩展包较为方便的方法:easy_install + packages name 在官网 https://pypi.python.org/pypi/setuptools/#windows-simplified 下载名字为 的文件. 在命令行窗口运行 ,安装后,可在python2.7文件夹下生成Scripts文件夹.把路径D:\Python27\Scripts

scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类

scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import numpy as np from sklearn.pipeline import Pipeline from sklearn.linear_model import SGDClassifier from sklearn.grid_search import GridSearchCV from sk

Spark技术在京东智能供应链预测的应用——按照业务进行划分,然后利用scikit learn进行单机训练并预测

3.3 Spark在预测核心层的应用 我们使用Spark SQL和Spark RDD相结合的方式来编写程序,对于一般的数据处理,我们使用Spark的方式与其他无异,但是对于模型训练.预测这些需要调用算法接口的逻辑就需要考虑一下并行化的问题了.我们平均一个训练任务在一天处理的数据量大约在500G左右,虽然数据规模不是特别的庞大,但是Python算法包提供的算法都是单进程执行.我们计算过,如果使用一台机器训练全部品类数据需要一个星期的时间,这是无法接收的,所以我们需要借助Spark这种分布式并行计算

机器学习-scikit learn学习笔记

scikit-learn官网:http://scikit-learn.org/stable/ 通常情况下,一个学习问题会包含一组学习样本数据,计算机通过对样本数据的学习,尝试对未知数据进行预测. 学习问题一般可以分为: 监督学习(supervised learning) 分类(classification) 回归(regression) 非监督学习(unsupervised learning) 聚类(clustering) 监督学习和非监督学习的区别就是,监督学习中,样本数据会包含要预测的标签(

Linear Regression with Scikit Learn

Before you read ?This is a demo or practice about how to use Simple-Linear-Regression in scikit-learn with python. Following is the package version that I use below: The Python version: 3.6.2 The Numpy version: 1.8.0rc1 The Scikit-Learn version: 0.19

Scikit Learn安装教程

Windows下安装scikit-learn 准备工作 Python (>= 2.6 or >= 3.3), Numpy (>= 1.6.1) Scipy (>= 0.9), Matplotlib(可选). NumPy NumPy系统是Python的一种开源的数值计算扩展.这种工具可用来存储和处理大型矩阵,比Python自身的嵌套列表(nested list structure)结构要高效的多(该结构也可以用来表示矩阵(matrix)). Scipy SciPy是一款方便.易于使用

Scikit Learn

安装pip 代码如下:# wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate# tar -xzvf pip-1.5.4.tar.gz# cd pip-1.5.4# python setup.py install 输入pip如果能看到信息证明安装成功. 安装scikit-learn

【359】scikit learn 官方帮助文档

官方网站链接 KNN Home Installation Documentation Scikit-learn 0.20.2 (stable) Tutorials User guide API Glossary FAQ Contributing Roadmap Development version All available versions PDF documentation Examples Documentation of scikit-learn 0.20.2¶ Quick Start