scikit-learn的主要模块和基本使用

1.加载数据(Data Loading)

假设输入是特征矩阵或者csv文件，首先数据被载入内存。

scikit-learn的实现使用了NumPy中的arrays，所以，使用NumPy来载入csv文件。
以下是从UCI机器学习数据仓库中下载的数据。

#data loading
import numpy as np
import urllib
#url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
#download the file
raw_data = urllib.urlopen(url)
#load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter = ",")
#seperate the data from the target attributes
X = dataset[:, 0:7]
y = dataset[:, 8]

2.数据归一化(Data Normalization)

大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是很敏感的，在开始跑算法之前，我们应该进行归一化或者标准化的过程，这使得特征数据缩放到0-1范围中。scikit-learn提供了归一化的方法。

#data normalization
from sklearn import  preprocessing
#normalize the data attributes
normalized_X = preprocessing.normalize(X)
#standardize the data attributes
standardized_X = preprocessing.scale(X)

3.特征选择(Feature Selection)

在解决一个实际问题的过程中，选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。
特征选择时一个很需要创造力的过程，更多的依赖于直觉和专业知识，并且有很多现成的算法来进行特征的选择。
下面的树算法(Tree algorithms)计算特征的信息量：

#feature selection
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
#display the relative importance of each attribute
print(model.feature_importances_)

结果：

>>> runfile(‘F:/HDN20160329/python/spyder/example2_sklearn_procedure/sklearn_procedure.py‘, wdir=‘F:/HDN20160329/python/spyder/example2_sklearn_procedure‘)
[ 0.12315529  0.25870914  0.11863867  0.08749797  0.08296516  0.1840623
  0.14497146]

4.算法的使用

逻辑回归

大多数问题都可以归结为二元分类问题。这个算法的优点是可以给出数据所在类别的概率。

#logistic regression
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
#make predictions
expected = y
predicted = model.predict(X)
#summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

结果：

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class=‘ovr‘, n_jobs=1,
          penalty=‘l2‘, random_state=None, solver=‘liblinear‘, tol=0.0001,
          verbose=0, warm_start=False)
             precision    recall  f1-score   support

        0.0       0.79      0.89      0.84       500
        1.0       0.74      0.55      0.63       268

avg / total       0.77      0.77      0.77       768

[[447  53]
 [120 148]]

朴素贝叶斯

该方法的任务是还原训练样本数据的分布密度，其在多类别分类中有很好的效果。

#GaussianNB
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
#make predicitions
expected = y
predicted = model.predict(X)
#summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

结果：

GaussianNB()
             precision    recall  f1-score   support

        0.0       0.80      0.86      0.83       500
        1.0       0.69      0.60      0.64       268

avg / total       0.76      0.77      0.76       768

[[429  71]
 [108 160]]

k近邻

k近邻算法常常被用作是分类算法一部分，比如可以用它来评估特征，在特征选择上我们可以用到它。

#KNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
expected = y
predicted = model.predict(X)
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

结果：

KNeighborsClassifier(algorithm=‘auto‘, leaf_size=30, metric=‘minkowski‘,
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights=‘uniform‘)
             precision    recall  f1-score   support

        0.0       0.82      0.90      0.86       500
        1.0       0.77      0.63      0.69       268

avg / total       0.80      0.80      0.80       768

[[448  52]
 [ 98 170]]

决策树

分类与回归树(Classification and Regression Trees ,CART)算法常用于特征含有类别信息的分类或者回归问题，这种方法非常适用于多分类情况。

#decision tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
expected = y
predicted = model.predict(X)
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

结果：

DecisionTreeClassifier(class_weight=None, criterion=‘gini‘, max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter=‘best‘)
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00       500
        1.0       1.00      1.00      1.00       268

avg / total       1.00      1.00      1.00       768

[[500   0]
 [  0 268]]

SVM

SVM是非常流行的机器学习算法，主要用于分类问题，如同逻辑回归问题，它可以使用一对多的方法进行多类别的分类。

#SVM
from sklearn.svm import SVC
model = SVC()
model.fit(X, y)
print(model)
expected = y
predicted = model.predict(X)
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

结果：

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=‘auto‘, kernel=‘rbf‘,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00       500
        1.0       1.00      1.00      1.00       268

avg / total       1.00      1.00      1.00       768

[[500   0]
 [  0 268]]

5.如何优化算法参数

一项更加困难的任务是构建一个有效的方法用于选择正确的参数，我们需要用搜索的方法来确定参数。scikit-learn提供了实现这一目标的函数。

下面的例子是一个进行正则参数选择的程序：

#paramater selection
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
#prepare a range of alpha values to test
alphas = np.array([1, 0.1, 0.01, 0.001, 0.0001, 0])
#create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator = model, param_grid = dict(alpha = alphas))
grid.fit(X, y)
print(grid)
#summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

结果：

GridSearchCV(cv=None, error_score=‘raise‘,
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver=‘auto‘, tol=0.001),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={‘alpha‘: array([  1.00000e+00,   1.00000e-01,   1.00000e-02,   1.00000e-03,
         1.00000e-04,   0.00000e+00])},
       pre_dispatch=‘2*n_jobs‘, refit=True, scoring=None, verbose=0)
0.282118955686
1.0

有时随机从给定区间中选择参数是很有效的方法，然后根据这些参数来评估算法的效果进而选择最佳的那个。

from scipy.stats import uniform as sp_rand
from sklearn.linear_model import  Ridge
from sklearn.grid_search import  RandomizedSearchCV
#prepare a uniform distribution to sample for the alpha parameter
param_grid = {‘alpha‘: sp_rand()}
model = Ridge()
rsearch = RandomizedSearchCV(estimator = model, param_distributions = param_grid, n_iter = 100)
rsearch.fit(X, y)
print(rsearch)
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

结果：

RandomizedSearchCV(cv=None, error_score=‘raise‘,
          estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver=‘auto‘, tol=0.001),
          fit_params={}, iid=True, n_iter=100, n_jobs=1,
          param_distributions={‘alpha‘: <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000000008739C18>},
          pre_dispatch=‘2*n_jobs‘, random_state=None, refit=True,
          scoring=None, verbose=0)
0.282118896925
0.997818886895

时间： 2024-10-12 03:31:48

scikit-learn的主要模块和基本使用的相关文章

scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类

scikit learn 模块调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import numpy as np from sklearn.pipeline import Pipeline from sklearn.linear_model import SGDClassifier from sklearn.grid_search import GridSearchCV from sk

Query意图分析：记一次完整的机器学习过程（scikit learn library学习笔记）

所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的垂直搜索引擎,专门为用户提供团购.优惠券的检索:同时存在一个通用的搜索引擎,比如百度,通用搜索引擎希望能够识别出一个Query是否具有O2O检索意图,如果有则调用O2O垂直搜索引擎,获取结果作为通用搜索引擎的结果补充. 我们的目的是学习出一个分类器(classifier),分类器可以理解为一个函数,

Python之扩展包安装（scikit learn）

scikit learn 是Python下开源的机器学习包.(安装环境:win7.0 32bit和Python2.7) Python安装第三方扩展包较为方便的方法:easy_install + packages name 在官网 https://pypi.python.org/pypi/setuptools/#windows-simplified 下载名字为的文件. 在命令行窗口运行 ,安装后,可在python2.7文件夹下生成Scripts文件夹.把路径D:\Python27\Scripts

Scikit Learn安装教程

Windows下安装scikit-learn 准备工作 Python (>= 2.6 or >= 3.3), Numpy (>= 1.6.1) Scipy (>= 0.9), Matplotlib(可选). NumPy NumPy系统是Python的一种开源的数值计算扩展.这种工具可用来存储和处理大型矩阵,比Python自身的嵌套列表(nested list structure)结构要高效的多(该结构也可以用来表示矩阵(matrix)). Scipy SciPy是一款方便.易于使用

Linear Regression with Scikit Learn

Before you read ?This is a demo or practice about how to use Simple-Linear-Regression in scikit-learn with python. Following is the package version that I use below: The Python version: 3.6.2 The Numpy version: 1.8.0rc1 The Scikit-Learn version: 0.19

Scikit Learn

安装pip 代码如下:# wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate# tar -xzvf pip-1.5.4.tar.gz# cd pip-1.5.4# python setup.py install 输入pip如果能看到信息证明安装成功. 安装scikit-learn

Spark技术在京东智能供应链预测的应用——按照业务进行划分，然后利用scikit learn进行单机训练并预测

3.3 Spark在预测核心层的应用我们使用Spark SQL和Spark RDD相结合的方式来编写程序,对于一般的数据处理,我们使用Spark的方式与其他无异,但是对于模型训练.预测这些需要调用算法接口的逻辑就需要考虑一下并行化的问题了.我们平均一个训练任务在一天处理的数据量大约在500G左右,虽然数据规模不是特别的庞大,但是Python算法包提供的算法都是单进程执行.我们计算过,如果使用一台机器训练全部品类数据需要一个星期的时间,这是无法接收的,所以我们需要借助Spark这种分布式并行计算

机器学习-scikit learn学习笔记

scikit-learn官网:http://scikit-learn.org/stable/ 通常情况下,一个学习问题会包含一组学习样本数据,计算机通过对样本数据的学习,尝试对未知数据进行预测. 学习问题一般可以分为: 监督学习(supervised learning) 分类(classification) 回归(regression) 非监督学习(unsupervised learning) 聚类(clustering) 监督学习和非监督学习的区别就是,监督学习中,样本数据会包含要预测的标签(

【359】scikit learn 官方帮助文档

官方网站链接 KNN Home Installation Documentation Scikit-learn 0.20.2 (stable) Tutorials User guide API Glossary FAQ Contributing Roadmap Development version All available versions PDF documentation Examples Documentation of scikit-learn 0.20.2¶ Quick Start

6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)

6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python) Introduction Here’s a situation you’ve got into: You are working on a classification problem and you have generated your set of hypothesis, created features and discussed the importanc