How to tune hyperparameters with Python and scikit-learn?

Hyperparameter Optimization

In the context of machine learninghyperparameter optimization or model selection is the problem of choosing a set of hyperparameters[when defined as?] for a learning algorithm, usually with the goal of optimizing a measure of the algorithm‘s performance on an independent data set. Often cross-validation is used to estimate this generalization performance.[1] Hyperparameter optimization contrasts with actual learning problems, which are also often cast as optimization problems, but optimize a loss function on the training set alone. In effect, learning algorithms learn parameters that model/reconstruct their inputs well, while hyperparameter optimization is to ensure the model does not overfit its data by tuning, e.g., regularization.

Using the k-NN algorithm, we obtained 57.58% classification accuracy on the Kaggle Dogs vs. Cats dataset challenge:

The question is: “Can we do better?”

Of course we can! Obtaining higher accuracy for nearly any machine learning algorithm boils down to tweaking various knobs and levels.

In the case of k-NN, we can tune k, the number of nearest neighbors. We can also tune our distance metric/similarity function as well.

Of course, hyperparameter tuning has implications outside of the k-NN algorithm as well. In the context of Deep Learning and Convolutional Neural Networks, we can easily have hundreds of various hyperparameters to tune and play with (although in practice we try to limit the number of variables to tune to a small handful), each affecting our overall classification to some (potentially unknown) degree.

Because of this, it’s important to understand the concept of hyperparameter tuning and how your choice in hyperparameters can dramatically impact your classification accuracy.

How to tune hyperparameters with Python and scikit-learn

In the remainder of today’s post, I’ll be demonstrating how to tune k-NN hyperparameters for the Dogs vs. Cats dataset. We’ll start with a discussion on what hyperparameters are, followed by viewing a concrete example on tuning k-NN hyperparameters.

We’ll then explore how to tune k-NN hyperparameters using two search methods: Grid Search and Randomized Search.

As our results will demonstrate, we can improve our classification accuracy from 57.58% to over 64%!

What are hyperparameters?

Hyperparameters are simply the knobs and levels you pull and turn when building a machine learning classifier. The process of tuning hyperparameters is more formally called hyperparameter optimization.

So what’s the difference between a normal “model parameter” and a “hyperparameter”?

Well, a standard “model parameter” is normally an internal variable that is optimized in some fashion. In the context of Linear Regression, Logistic Regression, and Support Vector Machines, we would think of parameters as the weight vector coefficients found by the learning algorithm.

On the other hand, “hyperparameters” are normally set by a human designer or tuned via algorithmic approaches. Examples of hyperparameters include the number of neighbors k in the k-Nearest Neighbor algorithm, the learning rate alpha of a Neural Network, or the number of filters learned in a given convolutional layer in a CNN.

In general, model parameters are optimized according to some loss function, while hyperparameters are instead searched for by exploring various settings to see which values provided the highest level of accuracy.

Because of this, it tends to be easier to tune model parameters (since we’re optimizing some objective function based on our training data) whereas hyperparameters can require a nearly blind search to find optimal ones.

k-NN hyperparameters

As a concrete example of tuning hyperparameters, let’s consider the k-Nearest Neighbor classification algorithm. For your standard k-NN implementation, there are two primary hyperparameters that you’ll want to tune:

  1. The number of neighbors k.
  2. The distance metric/similarity function.

Both of these values can dramatically affect the accuracy of your k-NN classifier. To demonstrate this in the context of image classification, let’s apply hyperparameter tuning to our Kaggle Dogs vs. Cats dataset from last week.

Open up a new file, name it knn_tune.py , and insert the following code:

 1 # import the necessary packages
 2 from sklearn.neighbors import KNeighborsClassifier
 3 from sklearn.grid_search import RandomizedSearchCV
 4 from sklearn.grid_search import GridSearchCV
 5 from sklearn.cross_validation import train_test_split
 6 from imutils import paths
 7 import numpy as np
 8 import argparse
 9 import imutils
10 import time
11 import cv2
12 import os

Lines 2-12 start by importing our required Python packages. We’ll be making heavy use of the scikit-learn library.

We’ll also be using an imutils library, so make sure you have it installed as well:

$ pip install imutils 

Next, we’ll define our extract_color_histogram  function:

 1 # import the necessary packages
 2 from sklearn.neighbors import KNeighborsClassifier
 3 from sklearn.grid_search import RandomizedSearchCV
 4 from sklearn.grid_search import GridSearchCV
 5 from sklearn.cross_validation import train_test_split
 6 from imutils import paths
 7 import numpy as np
 8 import argparse
 9 import imutils
10 import time
11 import cv2
12 import os
13
14 def extract_color_histogram(image, bins=(8, 8, 8)):
15     # extract a 3D color histogram from the HSV color space using
16     # the supplied number of `bins` per channel
17     hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
18     hist = cv2.calcHist([hsv], [0, 1, 2], None, bins,
19         [0, 180, 0, 256, 0, 256])
20
21     # handle normalizing the histogram if we are using OpenCV 2.4.X
22     if imutils.is_cv2():
23         hist = cv2.normalize(hist)
24
25     # otherwise, perform "in place" normalization in OpenCV 3 (I
26     # personally hate the way this is done
27     else:
28         cv2.normalize(hist, hist)
29
30     # return the flattened histogram as the feature vector
31     return hist.flatten()

This function accepts an input image  along with a number of bins  for each channel of the image.

We convert the image to the HSV color space and compute a 3D color histogram to characterize the color distribution of the image (Lines 17-19).

This histogram is then flattened into a single 8 x 8 x 8 = 512-d feature vector that is returned to the calling function.

 1 # import the necessary packages
 2 from sklearn.neighbors import KNeighborsClassifier
 3 from sklearn.grid_search import RandomizedSearchCV
 4 from sklearn.grid_search import GridSearchCV
 5 from sklearn.cross_validation import train_test_split
 6 from imutils import paths
 7 import numpy as np
 8 import argparse
 9 import imutils
10 import time
11 import cv2
12 import os
13
14 def extract_color_histogram(image, bins=(8, 8, 8)):
15     # extract a 3D color histogram from the HSV color space using
16     # the supplied number of `bins` per channel
17     hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
18     hist = cv2.calcHist([hsv], [0, 1, 2], None, bins,
19         [0, 180, 0, 256, 0, 256])
20
21     # handle normalizing the histogram if we are using OpenCV 2.4.X
22     if imutils.is_cv2():
23         hist = cv2.normalize(hist)
24
25     # otherwise, perform "in place" normalization in OpenCV 3 (I
26     # personally hate the way this is done
27     else:
28         cv2.normalize(hist, hist)
29
30     # return the flattened histogram as the feature vector
31     return hist.flatten()
32
33 # construct the argument parse and parse the arguments
34 ap = argparse.ArgumentParser()
35 ap.add_argument("-d", "--dataset", required=True,
36     help="path to input dataset")
37 ap.add_argument("-j", "--jobs", type=int, default=-1,
38     help="# of jobs for k-NN distance (-1 uses all available cores)")
39 args = vars(ap.parse_args())
40
41 # grab the list of images that we‘ll be describing
42 print("[INFO] describing images...")
43 imagePaths = list(paths.list_images(args["dataset"]))
44
45 # initialize the data matrix and labels list
46 data = []
47 labels = []

Lines 34-39 handle parsing our command line arguments. We only need two switches here:

  • --dataset : The path to our input Dogs vs. Cats dataset from the Kaggle challenge.
  • --jobs : The number of processors/cores to utilize when computing the nearest neighbors for a particular data point. Setting this value to -1  indicates all available processors/cores should be used. Again, for a more detailed review of these arguments, please refer to last week’s tutorial.

Line 43 grabs the paths to our 25,000 input images while Lines 46 and 47 initializes thedata  list (where we’ll store the color histogram extracted from each image) and labels  list (either “dog” or “cat” for each input image), respectively.

Next, we can loop over our imagePaths  and describe them:

 1 # import the necessary packages
 2 from sklearn.neighbors import KNeighborsClassifier
 3 from sklearn.grid_search import RandomizedSearchCV
 4 from sklearn.grid_search import GridSearchCV
 5 from sklearn.cross_validation import train_test_split
 6 from imutils import paths
 7 import numpy as np
 8 import argparse
 9 import imutils
10 import time
11 import cv2
12 import os
13
14 def extract_color_histogram(image, bins=(8, 8, 8)):
15     # extract a 3D color histogram from the HSV color space using
16     # the supplied number of `bins` per channel
17     hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
18     hist = cv2.calcHist([hsv], [0, 1, 2], None, bins,
19         [0, 180, 0, 256, 0, 256])
20
21     # handle normalizing the histogram if we are using OpenCV 2.4.X
22     if imutils.is_cv2():
23         hist = cv2.normalize(hist)
24
25     # otherwise, perform "in place" normalization in OpenCV 3 (I
26     # personally hate the way this is done
27     else:
28         cv2.normalize(hist, hist)
29
30     # return the flattened histogram as the feature vector
31     return hist.flatten()
32
33 # construct the argument parse and parse the arguments
34 ap = argparse.ArgumentParser()
35 ap.add_argument("-d", "--dataset", required=True,
36     help="path to input dataset")
37 ap.add_argument("-j", "--jobs", type=int, default=-1,
38     help="# of jobs for k-NN distance (-1 uses all available cores)")
39 args = vars(ap.parse_args())
40
41 # grab the list of images that we‘ll be describing
42 print("[INFO] describing images...")
43 imagePaths = list(paths.list_images(args["dataset"]))
44
45 # initialize the data matrix and labels list
46 data = []
47 labels = []
48
49 # loop over the input images
50 for (i, imagePath) in enumerate(imagePaths):
51     # load the image and extract the class label (assuming that our
52     # path as the format: /path/to/dataset/{class}.{image_num}.jpg
53     image = cv2.imread(imagePath)
54     label = imagePath.split(os.path.sep)[-1].split(".")[0]
55
56     # extract a color histogram from the image, then update the
57     # data matrix and labels list
58     hist = extract_color_histogram(image)
59     data.append(hist)
60     labels.append(label)
61
62     # show an update every 1,000 images
63     if i > 0 and i % 1000 == 0:
64         print("[INFO] processed {}/{}".format(i, len(imagePaths)))

Line 50 starts looping over each of the imagePaths . For each imagePath , we load it from disk and extract the label  (Lines 53 and 54).

Now that we have our image , we compute a color histogram (Line 58), followed by updating the data  and labels  lists (Lines 59 and 60).

Finally, Lines 63 and 64 display the feature extraction progress to our screen.

In order to train and evaluate our k-NN classifier, we’ll need to partition our data  into two splits: a training split and a testing split:

 1 # import the necessary packages
 2 from sklearn.neighbors import KNeighborsClassifier
 3 from sklearn.grid_search import RandomizedSearchCV
 4 from sklearn.grid_search import GridSearchCV
 5 from sklearn.cross_validation import train_test_split
 6 from imutils import paths
 7 import numpy as np
 8 import argparse
 9 import imutils
10 import time
11 import cv2
12 import os
13
14 def extract_color_histogram(image, bins=(8, 8, 8)):
15     # extract a 3D color histogram from the HSV color space using
16     # the supplied number of `bins` per channel
17     hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
18     hist = cv2.calcHist([hsv], [0, 1, 2], None, bins,
19         [0, 180, 0, 256, 0, 256])
20
21     # handle normalizing the histogram if we are using OpenCV 2.4.X
22     if imutils.is_cv2():
23         hist = cv2.normalize(hist)
24
25     # otherwise, perform "in place" normalization in OpenCV 3 (I
26     # personally hate the way this is done
27     else:
28         cv2.normalize(hist, hist)
29
30     # return the flattened histogram as the feature vector
31     return hist.flatten()
32
33 # construct the argument parse and parse the arguments
34 ap = argparse.ArgumentParser()
35 ap.add_argument("-d", "--dataset", required=True,
36     help="path to input dataset")
37 ap.add_argument("-j", "--jobs", type=int, default=-1,
38     help="# of jobs for k-NN distance (-1 uses all available cores)")
39 args = vars(ap.parse_args())
40
41 # grab the list of images that we‘ll be describing
42 print("[INFO] describing images...")
43 imagePaths = list(paths.list_images(args["dataset"]))
44
45 # initialize the data matrix and labels list
46 data = []
47 labels = []
48
49 # loop over the input images
50 for (i, imagePath) in enumerate(imagePaths):
51     # load the image and extract the class label (assuming that our
52     # path as the format: /path/to/dataset/{class}.{image_num}.jpg
53     image = cv2.imread(imagePath)
54     label = imagePath.split(os.path.sep)[-1].split(".")[0]
55
56     # extract a color histogram from the image, then update the
57     # data matrix and labels list
58     hist = extract_color_histogram(image)
59     data.append(hist)
60     labels.append(label)
61
62     # show an update every 1,000 images
63     if i > 0 and i % 1000 == 0:
64         print("[INFO] processed {}/{}".format(i, len(imagePaths)))
65
66 # partition the data into training and testing splits, using 75%
67 # of the data for training and the remaining 25% for testing
68 print("[INFO] constructing training/testing split...")
69 (trainData, testData, trainLabels, testLabels) = train_test_split(
70     data, labels, test_size=0.25, random_state=42)

Here we’ll be using 75% of our data for training and the remaining 25% for evaluation.

Finally, let’s define the set of hyperparameters we are going to optimize over:

 1 # import the necessary packages
 2 from sklearn.neighbors import KNeighborsClassifier
 3 from sklearn.grid_search import RandomizedSearchCV
 4 from sklearn.grid_search import GridSearchCV
 5 from sklearn.cross_validation import train_test_split
 6 from imutils import paths
 7 import numpy as np
 8 import argparse
 9 import imutils
10 import time
11 import cv2
12 import os
13
14 def extract_color_histogram(image, bins=(8, 8, 8)):
15     # extract a 3D color histogram from the HSV color space using
16     # the supplied number of `bins` per channel
17     hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
18     hist = cv2.calcHist([hsv], [0, 1, 2], None, bins,
19         [0, 180, 0, 256, 0, 256])
20
21     # handle normalizing the histogram if we are using OpenCV 2.4.X
22     if imutils.is_cv2():
23         hist = cv2.normalize(hist)
24
25     # otherwise, perform "in place" normalization in OpenCV 3 (I
26     # personally hate the way this is done
27     else:
28         cv2.normalize(hist, hist)
29
30     # return the flattened histogram as the feature vector
31     return hist.flatten()
32
33 # construct the argument parse and parse the arguments
34 ap = argparse.ArgumentParser()
35 ap.add_argument("-d", "--dataset", required=True,
36     help="path to input dataset")
37 ap.add_argument("-j", "--jobs", type=int, default=-1,
38     help="# of jobs for k-NN distance (-1 uses all available cores)")
39 args = vars(ap.parse_args())
40
41 # grab the list of images that we‘ll be describing
42 print("[INFO] describing images...")
43 imagePaths = list(paths.list_images(args["dataset"]))
44
45 # initialize the data matrix and labels list
46 data = []
47 labels = []
48
49 # loop over the input images
50 for (i, imagePath) in enumerate(imagePaths):
51     # load the image and extract the class label (assuming that our
52     # path as the format: /path/to/dataset/{class}.{image_num}.jpg
53     image = cv2.imread(imagePath)
54     label = imagePath.split(os.path.sep)[-1].split(".")[0]
55
56     # extract a color histogram from the image, then update the
57     # data matrix and labels list
58     hist = extract_color_histogram(image)
59     data.append(hist)
60     labels.append(label)
61
62     # show an update every 1,000 images
63     if i > 0 and i % 1000 == 0:
64         print("[INFO] processed {}/{}".format(i, len(imagePaths)))
65
66 # partition the data into training and testing splits, using 75%
67 # of the data for training and the remaining 25% for testing
68 print("[INFO] constructing training/testing split...")
69 (trainData, testData, trainLabels, testLabels) = train_test_split(
70     data, labels, test_size=0.25, random_state=42)
71
72 # construct the set of hyperparameters to tune
73 params = {"n_neighbors": np.arange(1, 31, 2),
74     "metric": ["euclidean", "cityblock"]}

The above code block defines a params  dictionary which contains two keys:

  • n_neighbors : The number of nearest neighbors k in the k-NN algorithm. Here we’ll search over the odd integers in the range [0, 29] (keep in mind that the np.arange  function is exclusive).
  • metric : This is the distance function/similarity metric for k-NN. Normally this defaults to the Euclidean distance, but we could also use any function that returns a single floating point value representing how “similar” two images are. In this case, we’ll search over both the Euclidean distance and Manhattan/City block distance.

Now that we have defined the hyperparameters we want to search over, we need a method that actually applies the search. Luckily, the scikit-learn library already has two methods that can perform hyperparameter search for us: Grid Search and Randomized Search.

As we’ll find out, it’s normally preferable to used Randomized Search over Grid Search in nearly all circumstances.

Grid Search hyperparameters

The Grid Search tuning algorithm will methodically (and exhaustively) train and evaluate a machine learning classifier for each and every combination of hyperparameter values.

The primary benefit of the Grid Search algorithm is also it’s major drawback: as an exhaustive search your number of possible parameter values explodes as both the number of hyperparameters and hyperparameter values increases.

Sure, you get to evaluate each and every combination of hyperparameter — but you pay a cost — it’s a very time consuming cost. And in most cases, it’s hardly worth it.

As explain in the “Use Randomized Search for hyperparameter tuning (in most situations)” section below, there are rarely just one set of hyperparameters that obtain the highest accuracy.

Instead, there are “hot zones” of hyperparameters that all obtain near identical accuracy. The goal is to explore as many of these “zones” of hyperparameters a quickly as possible and locate one of these “hot zones”. It turns out that a random search is a great way to do this.

The Random Search approach to hyperparameter tuning will sample hyperparameters from our params  dictionary via a random, uniform distribution. Given a set of randomly sampled parameters, a model is then trained and evaluated.

We perform this set of random hyperparameter sampling and model construction/evaluation for a preset number of times. You set the number of evaluations to be as long as you’re willing to wait. If you’re impatient and in a hurry, make this value low. And if you have the time to spend on a longer experiment, increase the number of iterations.

In either case, the goal of a Randomized Search is to explore a large set of possible hyperparameter spaces quickly — and the best way to accomplish this is via simple random sampling. And in practice, it works quite well!

You can find the code to perform a Randomized Search of hyperparameters for the k-NN algorithm below:

How to tune hyperparameters with Python and scikit-learn

# import the necessary packages
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import RandomizedSearchCV
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from imutils import paths
import numpy as np
import argparse
import imutils
import time
import cv2
import os

def extract_color_histogram(image, bins=(8, 8, 8)):
	# extract a 3D color histogram from the HSV color space using
	# the supplied number of `bins` per channel
	hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
	hist = cv2.calcHist([hsv], [0, 1, 2], None, bins,
		[0, 180, 0, 256, 0, 256])

	# handle normalizing the histogram if we are using OpenCV 2.4.X
	if imutils.is_cv2():
		hist = cv2.normalize(hist)

	# otherwise, perform "in place" normalization in OpenCV 3 (I
	# personally hate the way this is done
	else:
		cv2.normalize(hist, hist)

	# return the flattened histogram as the feature vector
	return hist.flatten()

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
	help="path to input dataset")
ap.add_argument("-j", "--jobs", type=int, default=-1,
	help="# of jobs for k-NN distance (-1 uses all available cores)")
args = vars(ap.parse_args())

# grab the list of images that we‘ll be describing
print("[INFO] describing images...")
imagePaths = list(paths.list_images(args["dataset"]))

# initialize the data matrix and labels list
data = []
labels = []

# loop over the input images
for (i, imagePath) in enumerate(imagePaths):
	# load the image and extract the class label (assuming that our
	# path as the format: /path/to/dataset/{class}.{image_num}.jpg
	image = cv2.imread(imagePath)
	label = imagePath.split(os.path.sep)[-1].split(".")[0]

	# extract a color histogram from the image, then update the
	# data matrix and labels list
	hist = extract_color_histogram(image)
	data.append(hist)
	labels.append(label)

	# show an update every 1,000 images
	if i > 0 and i % 1000 == 0:
		print("[INFO] processed {}/{}".format(i, len(imagePaths)))

# partition the data into training and testing splits, using 75%
# of the data for training and the remaining 25% for testing
print("[INFO] constructing training/testing split...")
(trainData, testData, trainLabels, testLabels) = train_test_split(
	data, labels, test_size=0.25, random_state=42)

# construct the set of hyperparameters to tune
params = {"n_neighbors": np.arange(1, 31, 2),
	"metric": ["euclidean", "cityblock"]}

# tune the hyperparameters via a cross-validated grid search
print("[INFO] tuning hyperparameters via grid search")
model = KNeighborsClassifier(n_jobs=args["jobs"])
grid = GridSearchCV(model, params)
start = time.time()
grid.fit(trainData, trainLabels)

# evaluate the best grid searched model on the testing data
print("[INFO] grid search took {:.2f} seconds".format(
	time.time() - start))
acc = grid.score(testData, testLabels)
print("[INFO] grid search accuracy: {:.2f}%".format(acc * 100))
print("[INFO] grid search best parameters: {}".format(
	grid.best_params_))

# tune the hyperparameters via a randomized search
grid = RandomizedSearchCV(model, params)
start = time.time()
grid.fit(trainData, trainLabels)

# evaluate the best randomized searched model on the testing
# data
print("[INFO] randomized search took {:.2f} seconds".format(
	time.time() - start))
acc = grid.score(testData, testLabels)
print("[INFO] grid search accuracy: {:.2f}%".format(acc * 100))
print("[INFO] randomized search best parameters: {}".format(
	grid.best_params_))

Use Randomized Search for hyperparameter tuning (in most situations)

Unless your search space is small and can easily be enumerated, a Randomized Search will tend to be more efficient and yield better results faster.

As our experiments demonstrated, Randomized Search was able to obtain 64.03% accuracy in < 5 minutes while an exhaustive Grid Search took a much longer 13 minutes to obtain an identical 64.03% accuracy — that’s a 202% increase in evaluation time for identical accuracy!

In general, there isn’t just one set of hyperparameters that obtains optimal results — instead, there are usually a set of them that exist towards the bottom of a concave bowl (i.e., the optimization surface).

As long as you hit just one of these parameters towards the bottom of the bowl, you’ll still obtain the same accuracy as if you enumerated all possibilities along the bowl. Furthermore, you’ll be able to explore various regions of this bowl faster by applying a Randomized Search.

Overall, this will lead to faster, more efficient hyperparameter tunings in most situations.

Reference

http://www.pyimagesearch.com/2016/08/15/how-to-tune-hyperparameters-with-python-and-scikit-learn/

时间: 2024-11-09 00:39:00

How to tune hyperparameters with Python and scikit-learn?的相关文章

Python之扩展包安装(scikit learn)

scikit learn 是Python下开源的机器学习包.(安装环境:win7.0 32bit和Python2.7) Python安装第三方扩展包较为方便的方法:easy_install + packages name 在官网 https://pypi.python.org/pypi/setuptools/#windows-simplified 下载名字为 的文件. 在命令行窗口运行 ,安装后,可在python2.7文件夹下生成Scripts文件夹.把路径D:\Python27\Scripts

Query意图分析:记一次完整的机器学习过程(scikit learn library学习笔记)

所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的垂直搜索引擎,专门为用户提供团购.优惠券的检索:同时存在一个通用的搜索引擎,比如百度,通用搜索引擎希望能够识别出一个Query是否具有O2O检索意图,如果有则调用O2O垂直搜索引擎,获取结果作为通用搜索引擎的结果补充. 我们的目的是学习出一个分类器(classifier),分类器可以理解为一个函数,

scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类

scikit learn 模块 调参 pipeline+girdsearch 数据举例:文档分类数据集 fetch_20newsgroups #-*- coding: UTF-8 -*- import numpy as np from sklearn.pipeline import Pipeline from sklearn.linear_model import SGDClassifier from sklearn.grid_search import GridSearchCV from sk

python 安装scikit!!!

首先,吐槽一下,真的是折腾好几天,一会更新这个,一会更新那个,总是各种奇葩问题诸如此类: cannot import check-build pip有新版本,需要更新(黄字) 其中scipy出错最多,但是还是可安装的 找不到指定模块 no model XXX 诸如此类,各种更新就是不行 但是下边的这个文章真的是帮了大忙.........本身比较懒惰,不想全部卸载后在更新,所以一直各种问题,看了下边的文章,非常有用,当然有些人的可能真的只是scipy不兼容,那你更新一下就好了,要是还不行,你就全删

Linear Regression with Scikit Learn

Before you read ?This is a demo or practice about how to use Simple-Linear-Regression in scikit-learn with python. Following is the package version that I use below: The Python version: 3.6.2 The Numpy version: 1.8.0rc1 The Scikit-Learn version: 0.19

Scikit Learn安装教程

Windows下安装scikit-learn 准备工作 Python (>= 2.6 or >= 3.3), Numpy (>= 1.6.1) Scipy (>= 0.9), Matplotlib(可选). NumPy NumPy系统是Python的一种开源的数值计算扩展.这种工具可用来存储和处理大型矩阵,比Python自身的嵌套列表(nested list structure)结构要高效的多(该结构也可以用来表示矩阵(matrix)). Scipy SciPy是一款方便.易于使用

Scikit Learn

安装pip 代码如下:# wget "https://pypi.python.org/packages/source/p/pip/pip-1.5.4.tar.gz#md5=834b2904f92d46aaa333267fb1c922bb" --no-check-certificate# tar -xzvf pip-1.5.4.tar.gz# cd pip-1.5.4# python setup.py install 输入pip如果能看到信息证明安装成功. 安装scikit-learn

Spark技术在京东智能供应链预测的应用——按照业务进行划分,然后利用scikit learn进行单机训练并预测

3.3 Spark在预测核心层的应用 我们使用Spark SQL和Spark RDD相结合的方式来编写程序,对于一般的数据处理,我们使用Spark的方式与其他无异,但是对于模型训练.预测这些需要调用算法接口的逻辑就需要考虑一下并行化的问题了.我们平均一个训练任务在一天处理的数据量大约在500G左右,虽然数据规模不是特别的庞大,但是Python算法包提供的算法都是单进程执行.我们计算过,如果使用一台机器训练全部品类数据需要一个星期的时间,这是无法接收的,所以我们需要借助Spark这种分布式并行计算

机器学习-scikit learn学习笔记

scikit-learn官网:http://scikit-learn.org/stable/ 通常情况下,一个学习问题会包含一组学习样本数据,计算机通过对样本数据的学习,尝试对未知数据进行预测. 学习问题一般可以分为: 监督学习(supervised learning) 分类(classification) 回归(regression) 非监督学习(unsupervised learning) 聚类(clustering) 监督学习和非监督学习的区别就是,监督学习中,样本数据会包含要预测的标签(