标准化数据-StandardScaler

StandardScaler----计算训练集的平均值和标准差,以便测试数据集使用相同的变换

官方文档:

class sklearn.preprocessing.StandardScaler(copy=Truewith_mean=Truewith_std=True)

Standardize features by removing the mean and scaling to unit variance

通过删除平均值和缩放到单位方差来标准化特征

The standard score of a sample x is calculated as:

样本x的标准分数计算如下:

z = (x - u) / s

  where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

  其中u是训练样本的均值,如果with_mean=False,则为0

  s是训练样本的标准偏差,如果with_std=False,则为1

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

Read more in the User Guide.

Parameters:
copy : boolean, optional, default True

If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

with_mean : boolean, True by default

If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

with_std : boolean, True by default

If True, scale the data to unit variance (or equivalently, unit standard deviation).

Attributes:
scale_ : ndarray or None, shape (n_features,)

Per feature relative scaling of the data. This is calculated using np.sqrt(var_). Equal to None when with_std=False.

New in version 0.17: scale_

mean_ : ndarray or None, shape (n_features,)

The mean value for each feature in the training set. Equal to None when with_mean=False.

var_ : ndarray or None, shape (n_features,)

The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.

n_samples_seen_ : int or array, shape (n_features,)

The number of samples processed by the estimator for each feature. If there are not missing samples, the n_samples_seen will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments across partial_fit calls.

See also

scale
Equivalent function without the estimator API.
sklearn.decomposition.PCA
Further removes the linear correlation across features with ‘whiten=True’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>>

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]

Methods方法

fit(X[, y])
Compute the mean and std to be used for later scaling.

计算用于以后缩放的mean和std

fit_transform(X[, y])
Fit to data, then transform it.

适合数据,然后转换它

get_params([deep]) Get parameters for this estimator.
inverse_transform(X[, copy]) Scale back the data to the original representation
partial_fit(X[, y]) Online computation of mean and std on X for later scaling.
set_params(**params) Set the parameters of this estimator.
transform(X[, y, copy])
Perform standardization by centering and scaling

通过居中和缩放执行标准化

__init__(copy=Truewith_mean=Truewith_std=True)[source]
fit(Xy=None)[source]

Compute the mean and std to be used for later scaling.

Parameters:
X : {array-like, sparse matrix}, shape [n_samples, n_features]

The data used to compute the mean and standard deviation used for later scaling along the features axis.

y

Ignored

fit_transform(Xy=None**fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

使用可选参数fit_params是变换器适合X和Y,并返回X的变换版本

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(Xcopy=None)[source]

Scale back the data to the original representation

Parameters:
X : array-like, shape [n_samples, n_features]

The data used to scale along the features axis.

copy : bool, optional (default: None)

Copy the input X or not.

Returns:
X_tr : array-like, shape [n_samples, n_features]

Transformed array.

partial_fit(Xy=None)[source]

Online computation of mean and std on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.

The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:

Parameters:
X : {array-like, sparse matrix}, shape [n_samples, n_features]

The data used to compute the mean and standard deviation used for later scaling along the features axis.

y

Ignored

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(Xy=’deprecated’copy=None)[source]

Perform standardization by centering and scaling

Parameters:
X : array-like, shape [n_samples, n_features]

The data used to scale along the features axis.

y : (ignored)

Deprecated since version 0.19: This parameter will be removed in 0.21.

copy : bool, optional (default: None)

Copy the input X or not.

原文地址:https://www.cnblogs.com/cola-1998/p/10218276.html

时间: 2024-10-28 09:59:27

标准化数据-StandardScaler的相关文章

numpy数组-标准化数据

标准化数据的公式: (数据值 - 平均数) / 标准差 1 import numpy as np 2 3 employment = np.array([ 4 55.70000076, 51.40000153, 50.5 , 75.69999695, 5 58.40000153, 40.09999847, 61.5 , 57.09999847, 6 60.90000153, 66.59999847, 60.40000153, 68.09999847, 7 66.90000153, 53.40000

欧派家居牵手用友云平台 打造标准化数据资产管理平台

前言:大数据是创新驱动发展的重要引擎,无论是对于经济增长还是对于企业发展都具有重要创新引领作用.运用大数据技术能够揭示企业各模块之间的关联性.逻辑性和复杂性,合理的进行数据治理,能够不断推动企业运营走向数据化.标准化和精细化.在当今数据为王的时代,许多企业都知道主数据的重要性,可是大部分公司的主数据都做的很差,因为企业信息化过程中,大多企业考虑的是短期效益跟快速见效,系统上多了,才恍然间懂得主数据的重要性.数据标准化的重要性,可是推翻进行,工作量又十分巨大,最后都不了了之.企业背景:欧派家居集团

机器学习数据预处理——标准化/归一化方法总结

通常,在Data Science中,预处理数据有一个很关键的步骤就是数据的标准化.这里主要引用sklearn文档中的一些东西来说明,主要把各个标准化方法的应用场景以及优缺点总结概括,以来充当笔记. 首先,我要引用我自己的文章Feature Preprocessing on Kaggle里面关于Scaling的描述 Tree-based models doesn't depend on scaling Non-tree-based models hugely depend on scaling 一

数据归一化/标准化

''' [课程2.3] 数据归一化/标准化 数据的标准化(normalization)是将数据按比例缩放,使之落入一个小的特定区间. 在某些比较和评价的指标处理中经常会用到,去除数据的单位限制,将其转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权 最典型的就是数据的归一化处理,即将数据统一映射到[0,1]区间上 0-1标准化 / Z-score标准化 ''' import numpy as np import pandas as pd import matplotlib.pypl

【数据分析&amp;数据挖掘】三种数据标准化方式——离差标准化、标准差标准化&amp;小数定标标准化

1 import pandas as pd 2 import numpy as np 3 4 5 # 标准化----去除量级的影响 6 7 # 3种方式 8 # (1)离差标准化 9 # 将数据做线性变化,将数据映射到[0,1]范围内, 10 # x = (x - min) / (max - min) 11 # 过大或者过小的异常值都会对结果产生影响 12 # 容易受到异常值影响 13 def max_min_sca(data): 14 """ 15 借助离差标准化 来标准化

数据预处理与特征选择

数据预处理和特征选择是数据挖掘与机器学习中关注的重要问题,坊间常说:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已.特征工程就是将原始数据转化为有用的特征,更好的表示预测模型处理的实际问题,提升对于未知数据的预测准确性.下图给出了特征工程包含的内容: 本文数据预处理与特征选择的代码均采用sklearn所提供的方法,并使用sklearn中的IRIS(鸢尾花)数据集来对特征处理功能进行说明,IRIS数据集由Fisher在1936年整理,包含4个特征:Sepal.Length(花萼长

Python初探——sklearn库中数据预处理函数fit_transform()和transform()的区别

敲<Python机器学习及实践>上的code的时候,对于数据预处理中涉及到的fit_transform()函数和transform()函数之间的区别很模糊,查阅了很多资料,这里整理一下: # 从sklearn.preprocessing导入StandardScaler from sklearn.preprocessing import StandardScaler # 标准化数据,保证每个维度的特征数据方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导 ss = Standard

特征提取(机器学习数据预处理)

特征提取(机器学习数据预处理) 特征提取与特征选择都是数据降维的技术,不过二者有着本质上的区别:特征选择能够保持数据的原始特征,最终得到的降维数据其实是原数据集的一个子集:而特征提取会通过数据转换或数据映射得到一个新的特征空间,尽管新的特征空间是在原特征基础上得来的,但是凭借人眼观察可能看不出新数据集与原始数据集之间的关联. 这里介绍2种常见的特征提取技术: 1)主成分分析(PCA) 2)线性判别分析(LDA) 1.主成分分析(PCA) 一种无监督的数据压缩,数据提取技术,通常用于提高计算效率,

数据科学流程之创建新特征

当特征和目标变量不是很相关时,可以修改输入的数据集,应用线性,非线性变换(或者其他相似方法)来提高系统的精度. - 数据是“死”的,人的思维是“活”的. - 数据科学家负责改变数据集和输入数据,使数据更好的符合分类模型. 基本方法:A. 特征的线性修正 B. 特征的非线性修正 K近邻方法(K-Nearset neighbors,KNN) K近邻算法思路: 在特征空间中,如果一个样本附近的k个最近(即特征空间中最邻近)样本的大多数属于某一个类别,则该样本也属于这个类别. K邻近算法步骤: 在分类过