Linear Regression with Scikit Learn

Before you read

?This is a demo or practice about how to use Simple-Linear-Regression in scikit-learn with python. Following is the package version that I use below:

The Python version: 3.6.2

The Numpy version: 1.8.0rc1

The Scikit-Learn version: 0.19.0

The Matplotlib version: 2.0.2

Training Data

Here is the training data about the Relationship between Pizza and Diameter below:

training data Diameter(inch) Price($)
1 6 7
2 8 9
3 10 13
4 14 17.5
5 18 18

Now, we can plot the figure about the diameter and price first:

import matplotlib as plt

def run_plt():
    plt.figure()
    plt.title('Pizza Price with diameter.')
    plt.xlabel('diameter(inch)')
    plt.ylabel('price($)')
    plt.axis([0, 25, 0, 25])
    plt.grid(True)
    return plt

X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]

plt = run_plt()
plt.plot(X, y, 'k.')
plt.show()

Now we get the figure here.

Next, we use linear regression to fit this model.

from scikit.linear_model import LinearRegression

model = LinearRegression()
# X and y is the data in previous code.
model.fit(X, y)
# To predict the 12inch pizza price.
price = model.predict([12][0])
print('The 12 Pizza price: % .2f' % price)
# The 12 Pizza price: 13.68

The Simple Linear Regression define:

Simple linear regression assumes that a linear relationship exists between the response variable and explanatory variable; it models this relationship with a linear surface called a hyperplane. A hyperplane is a subspace that has one dimension less than the ambient space that contains it. In simple linear regression, there is one dimension for the response variable and another dimension for the explanatory variable, making a total of two dimensions. The regression hyperplane therefore, has one dimension; a hyperplane with one dimension is a line.

The Simple Linear Regression model that scikit-learn use is below:

\(y = \alpha + \beta * x\)

\(y\) is the predicted value of the response variable. \(x\) is the explanatory variable. \(alpha\) and \(beta\) are learned by the learning algorithm.

If we have a data \(X_{2}\) like that,

\(X_{2}\) = [[0], [10], [14], [25]]

We want to use Linear Regression to Predict the Prize Price and Print the Figure. There are two steps:

  1. Use \(x\), \(y\) previous to fit the model.
  2. Predict the Prize price.
model = LinearRegression()
# X, y is the prevoius data
model.fit(X,y)

X2 = [[0], [10], [14], [25]]
y2 = model.predict(X2)

plt.plot(X2, y2, 'g-')

The figure is following:

Summarize

The function previous that I used is called ordinary least squares. The process is :

  1. Define the cost function and fit the training data.
  2. Get the predict data.

Evaluating the fitness of a model with a cost function

There are serveral line created by different parmeters, and we got a question is that which one is the best-fitting regression line ?

plt = run_plt()
plt.plot(X, y, 'k.')
y3 = [14.25, 14.25, 14.25, 14.25]
y4 = y2 * 0.5 + 5

model.fit(X[1:-1], y[1:-1])

y5 = model.predict(X2)

plt.plot(X2, y2, 'g-.')
plt.plot(X2, y3, 'r-.')
plt.plot(X2, y4, 'y-.')
plt.plot(X2, y5, 'o-')
plt.show()

The Define of cost function

A cost function, also called a loss function, is used to de ne and measure the
error of a model. The differences between the prices predicted by the model andthe observed prices of the pizzas in the training set are called residuals or training errors. Later, we will evaluate a model on a separate set of test data; the differences between the predicted and observed values in the test data are called prediction errors or test errors.

The figure is like that:

The original data is black point, as we can see, the green line is the best-fitting regression line. But how computer know!!!!

So we should use some mathematic method to tell the computer which one is best-fitting.

model.fit(X, y)
yr = model.predict(X)

for idx, x in enumerate(X)
    plt.plot([x, x], [y[idx], yr[idx]], 'r-')

Next we plot the residuals figure.

We can use residual sum of squares to measure the fitness.

\(SS_{res} = \sum _{i =1}^n(y_{i} - f(x_{i}))^{2}\)

Use Numpy package to calculate the \(SS_{res}\) value is 1.75

import numpy as np
SSres = np.mean((model.predict(X) - y)** 2)

Solving ordinary least squares for simple linear regression

Recall that simple linear regression is that:

\(y = \alpha + \beta * x\)

Our goal is to get the value of \(alpha\) and \(beta\). We will solve \(beta\) first, we should calculate the variance of \(x\) and covariance of \(x\) and \(y\).

Variance is a measure of how far a set of values is spread out. If all of the numbers in the set are equal, the variance of the set is zero.

\(var(x) = \frac{\sum_{i=1}^n(x_{i} - \overline{x})^{2}}{n-1}\)

\(\overline{x}\) is the mean of x .

var = np.var(X, ddof =1)
# var = 23.2

Convariance is a measure of how much two variales change to together. If the value of variables increase together. their convariace is positive. If one variable tends to increase while the other decreases, their convariace is negative. If their is no linear relationship between the two variables, their convariance will be equals to zero.

\(cov(x,y) = \frac{\sum_{i=1}^n(x_{i}-\overline{x})(y_{i}-\overline{y})}{n-1}\)

import numpy as np
cov = np.cov([6, 8, 10, 14, 18], [7, 9, 13, 17.5, 18])[0][1]

Their is a formula solve \(\beta\)

\(\beta = \frac{cov(x,y)}{var(x)}\)

\(\beta = \frac{22.65}{23.2} = 0.9762\)

We can solve \(\alpha\) as the following formula:

\(\alpha = \overline{y} - \beta * \overline{x}\)

\(\alpha = 12.9 - 0.9762 * 11.2 =1.9655\)

Summarize

The Regression formula is like following:

\(y = 1.9655 + 0.9762 * x\)

原文地址:https://www.cnblogs.com/xiyin/p/8485982.html

时间: 2024-10-25 16:53:04

Linear Regression with Scikit Learn的相关文章

机器学习 1 linear regression 作业(二)

机器学习 1 linear regression 作业(二) 这个线性回归的作业需要上传到https://inclass.kaggle.com/c/ml2016-pm2-5-prediction 上面,这是一个kaggle比赛的网站.第一次接触听说这个东西,恰好在京东上有一本刚出来的关于这个的书<Python机器学习及实践:从零开始通往Kaggle竞赛之路>.把我自己写的代码运行保存的结果提交上去后发现,损失函数值很大,baseline是6,而我的却是8,于是很不心甘,尝试了其他方法无果后,准

Programming Assignment 1: Linear Regression

Warm-up Exercise Follow the instruction, type the code in warmUpExercise.m file: A = eye(5); Computing Cost (for One Variable) By the formula for cost function (for One Variable): J(θ0, θ1) = 1/(2m)*∑i=1~m(hθ(x(i)-y(i))2 We can implement it in comput

Andrew Ng Machine Learning 专题【Linear Regression】

此文是斯坦福大学,机器学习界 superstar - Andrew Ng 所开设的 Coursera 课程:Machine Learning 的课程笔记. 力求简洁,仅代表本人观点,不足之处希望大家探讨. 课程网址:https://www.coursera.org/learn/machine-learning/home/welcome Week 3: Logistic Regression & Regularization 笔记:http://blog.csdn.net/ironyoung/ar

斯坦福大学机器学习公开课---Programming Exercise 1: Linear Regression

斯坦福大学机器学习公开课---Programming Exercise 1: Linear Regression 1  Linear regression with one variable In thispart of this exercise, you will implement linear regression with one variableto predict profits for a food truck. Suppose you are the CEO of a rest

Coursera machine learning 第二周 编程作业 Linear Regression

必做: [*] warmUpExercise.m - Simple example function in Octave/MATLAB[*] plotData.m - Function to display the dataset[*] computeCost.m - Function to compute the cost of linear regression[*] gradientDescent.m - Function to run gradient descent 1.warmUpE

Query意图分析:记一次完整的机器学习过程(scikit learn library学习笔记)

所谓学习问题,是指观察由n个样本组成的集合,并根据这些数据来预测未知数据的性质. 学习任务(一个二分类问题): 区分一个普通的互联网检索Query是否具有某个垂直领域的意图.假设现在有一个O2O领域的垂直搜索引擎,专门为用户提供团购.优惠券的检索:同时存在一个通用的搜索引擎,比如百度,通用搜索引擎希望能够识别出一个Query是否具有O2O检索意图,如果有则调用O2O垂直搜索引擎,获取结果作为通用搜索引擎的结果补充. 我们的目的是学习出一个分类器(classifier),分类器可以理解为一个函数,

二、Linear Regression 练习(转载)

转载链接:http://www.cnblogs.com/tornadomeet/archive/2013/03/15/2961660.html 前言 本文是多元线性回归的练习,这里练习的是最简单的二元线性回归,参考斯坦福大学的教学网http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex2/ex2.html.本题给出的是50个数据样本点,其中x为这50个小朋

Matlab学习 之 linear regression

本文练习的是最简单的二元线性回归. 题目 本题给出的是50个数据样本点,其中x为这50个小朋友到的年龄,年龄为2岁到8岁,年龄可有小数形式呈现.Y为这50个小朋友对应的身高,当然也是小数形式表示的.现在的问题是要根据这50个训练样本,估计出3.5岁和7岁时小孩子的身高.通过画出训练样本点的分布凭直觉可以发现这是一个典型的线性回归问题. matlab函数介绍 legend: 比如legend('Training data', 'Linear regression'),它表示的是标出图像中各曲线标志

转载 Deep learning:二(linear regression练习)

前言 本文是多元线性回归的练习,这里练习的是最简单的二元线性回归,参考斯坦福大学的教学网http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex2/ex2.html.本题给出的是50个数据样本点,其中x为这50个小朋友到的年龄,年龄为2岁到8岁,年龄可有小数形式呈现.Y为这50个小朋友对应的身高,当然也是小数形式表示的.现在的问题是要根据这50个训练样本,估