Python下的数据处理和机器学习，对数据在线及本地获取、解析、预处理和训练、预测、交叉验证、可视化

http://blog.csdn.net/pipisorry/article/details/44833603

在[1]:

%matplotlib inline

抓取的数据

一个简单的HTTP请求

在[2]:

import requests

print requests.get("http://example.com").text

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

与api交流

在[3]:

response = requests.get("https://www.googleapis.com/books/v1/volumes", params={"q":"machine learning"})
raw_data = response.json()
titles = [item[‘volumeInfo‘][‘title‘] for item in raw_data[‘items‘]]
titles

[3]:

[u‘C4.5‘,
 u‘Machine Learning‘,
 u‘Machine Learning‘,
 u‘Machine Learning‘,
 u‘A First Course in Machine Learning‘,
 u‘Machine Learning‘,
 u‘Elements of Machine Learning‘,
 u‘Introduction to Machine Learning‘,
 u‘Pattern Recognition and Machine Learning‘,
 u‘Machine Learning and Its Applications‘]

在[4]:

import lxml.html

page = lxml.html.parse("http://www.blocket.se/stockholm?q=apple")
# ^ This is probably illegal. Blocket, please don‘t sue me!
items_data = []
for el in page.getroot().find_class("item_row"):
    links = el.find_class("item_link")
    images = el.find_class("item_image")
    prices = el.find_class("list_price")
    if links and images and prices and prices[0].text:
        items_data.append({"name": links[0].text,
                           "image": images[0].attrib[‘src‘],
                           "price": int(prices[0].text.split(":")[0].replace(" ", ""))})
items_data

[4]:

[{‘image‘: ‘http://cdn.blocket.com/static/2/lithumbs/98/9864322297.jpg‘,
  ‘name‘: ‘Macbook laddare 60w‘,
  ‘price‘: 250},
 {‘image‘: ‘http://cdn.blocket.com/static/2/lithumbs/43/4338840758.jpg‘,
  ‘name‘: u‘Apple iPhone 5S 16GB - Ol\xe5st - 12 m\xe5n garanti‘,
  ‘price‘: 3999},
 {‘image‘: ‘http://cdn.blocket.com/static/0/lithumbs/98/9838946223.jpg‘,
  ‘name‘: u‘Ol\xe5st iPhone 5 64 GB med n\xe4stan nytt batteri‘,
  ‘price‘: 3000},
 {‘image‘: ‘http://cdn.blocket.com/static/1/lithumbs/79/7906971367.jpg‘,
  ‘name‘: u‘Apple iPhone 5C 16GB - Ol\xe5st - 12 m\xe5n garanti‘,
  ‘price‘: 3099},
 {‘image‘: ‘http://cdn.blocket.com/static/0/lithumbs/79/7926951568.jpg‘,
  ‘name‘: u‘HP Z620 Workstation - 1 \xe5rs garanti‘,
  ‘price‘: 12494},
 {‘image‘: ‘http://cdn.blocket.com/static/0/lithumbs/97/9798755036.jpg‘,
  ‘name‘: ‘HP ProBook 6450b - Andrasortering‘,
  ‘price‘: 1699},
 {‘image‘: ‘http://cdn.blocket.com/static/1/lithumbs/98/9898462036.jpg‘,
  ‘name‘: ‘Macbook pro 13 retina, 256 gb ssd‘,
  ‘price‘: 12000}]

阅读本地数据

在[5]:

import pandas

df = pandas.read_csv(‘sample.csv‘)

在[6]:

# Display the DataFrame
df

[6]:

	一年	使	模型	描述	价格
0	1997年	福特	E350	交流、abs、月亮	3000年
1	1999年	雪佛兰	合资企业“加长版”	南	4900年
2	1999年	雪佛兰	合资企业“扩展版,非常大”	南	5000年
3	1996年	吉普车	大切诺基	必须出售! \ nair月亮屋顶,加载	南

在[7]:

# DataFrame‘s columns
df.columns

[7]:

Index([u‘Year‘, u‘Make‘, u‘Model‘, u‘Description‘, u‘Price‘], dtype=‘object‘)

在[8]:

# Values of a given column
df.Model

[8]。

0                                      E350
1                Venture "Extended Edition"
2    Venture "Extended Edition, Very Large"
3                            Grand Cherokee
Name: Model, dtype: object

分析了dataframe

在[9]:

# Any missing values?
df[‘Price‘]

[9]:

0    3000
1    4900
2    5000
3     NaN
Name: Price, dtype: float64

在[10]:

df[‘Description‘]

[10]。

0                         ac, abs, moon
1                                   NaN
2                                   NaN
3    MUST SELL!\nair, moon roof, loaded
Name: Description, dtype: object

在[11]:

# Fill missing prices by a linear interpolation
df[‘Description‘] = df[‘Description‘].fillna("No description is available.")
df[‘Price‘] = df[‘Price‘].interpolate()

df

[11]。

	一年	使	模型	描述	价格
0	1997年	福特	E350	交流、abs、月亮	3000年
1	1999年	雪佛兰	合资企业“加长版”	没有可用的描述。	4900年
2	1999年	雪佛兰	合资企业“扩展版,非常大”	没有可用的描述。	5000年
3	1996年	吉普车	大切诺基	必须出售! \ nair月亮屋顶,加载	5000年

探索数据

在[12]:

import matplotlib.pyplot as plt

df = pandas.read_csv(‘sample2.csv‘)

df

[12]。

	办公室	一年	销售
0	斯德哥尔摩	2004年	200年
1	斯德哥尔摩	2005年	250年
2	斯德哥尔摩	2006年	255年
3	斯德哥尔摩	2007年	260年
4	斯德哥尔摩	2008年	264年
5	斯德哥尔摩	2009年	274年
6	斯德哥尔摩	2010年	330年
7	斯德哥尔摩	2011年	364年
8	纽约	2004年	432年
9	纽约	2005年	469年
10	纽约	2006年	480年
11	纽约	2007年	438年
12	纽约	2008年	330年
13	纽约	2009年	280年
14	纽约	2010年	299年
15	纽约	2011年	230年

在[13]:

# This table has 3 columns: Office, Year, Sales
print df.columns

# It‘s really easy to query data with Pandas:
print df[(df[‘Office‘] == ‘Stockholm‘) & (df[‘Sales‘] > 260)]

# It‘s also easy to do aggregations...
aggregated_sales = df.groupby(‘Year‘).sum()
print aggregated_sales

Index([u‘Office‘, u‘Year‘, u‘Sales‘], dtype=‘object‘)
      Office  Year  Sales
4  Stockholm  2008    264
5  Stockholm  2009    274
6  Stockholm  2010    330
7  Stockholm  2011    364
      Sales
Year
2004    632
2005    719
2006    735
2007    698
2008    594
2009    554
2010    629
2011    594

在[14]:

# ... and generate plots
%matplotlib inline
aggregated_sales.plot(kind=‘bar‘)

[14]。

<matplotlib.axes._subplots.AxesSubplot at 0x1089dcc10>

机器学习

特征提取

在[15]:

from sklearn import feature_extraction

从文本中提取特征

在[16]:

corpus = [‘All the cats really are great.‘,
          ‘I like the cats but I still prefer the dogs.‘,
          ‘Dogs are the best.‘,
          ‘I like all the trains‘,
          ]

tfidf = feature_extraction.text.TfidfVectorizer()

print tfidf.fit_transform(corpus).toarray()
print tfidf.get_feature_names()

[[ 0.38761905  0.38761905  0.          0.          0.38761905  0.
   0.49164562  0.          0.          0.49164562  0.          0.25656108
   0.        ]
 [ 0.          0.          0.          0.4098205   0.32310719  0.32310719
   0.          0.32310719  0.4098205   0.          0.4098205   0.42772268
   0.        ]
 [ 0.          0.4970962   0.6305035   0.          0.          0.4970962
   0.          0.          0.          0.          0.          0.32902288
   0.        ]
 [ 0.4970962   0.          0.          0.          0.          0.          0.
   0.4970962   0.          0.          0.          0.32902288  0.6305035 ]]
[u‘all‘, u‘are‘, u‘best‘, u‘but‘, u‘cats‘, u‘dogs‘, u‘great‘, u‘like‘, u‘prefer‘, u‘really‘, u‘still‘, u‘the‘, u‘trains‘]

Dict vectorizer

在[17]:

import json

data = [json.loads("""{"weight": 194.0, "sex": "female", "student": true}"""),
        {"weight": 60., "sex": ‘female‘, "student": True},
        {"weight": 80.1, "sex": ‘male‘, "student": False},
        {"weight": 65.3, "sex": ‘male‘, "student": True},
        {"weight": 58.5, "sex": ‘female‘, "student": False}]

vectorizer = feature_extraction.DictVectorizer(sparse=False)

vectors = vectorizer.fit_transform(data)
print vectors
print vectorizer.get_feature_names()

[[   1.     0.     1.   194. ]
 [   1.     0.     1.    60. ]
 [   0.     1.     0.    80.1]
 [   0.     1.     1.    65.3]
 [   1.     0.     0.    58.5]]
[u‘sex=female‘, ‘sex=male‘, u‘student‘, u‘weight‘]

在[18]:

class A:
    def __init__(self, x):
        self.x = x
        self.blabla = ‘test‘

a = A(20)
a.__dict__

出[18]:

{‘blabla‘: ‘test‘, ‘x‘: 20}

预处理

扩展

在[19]:

from sklearn import preprocessing

data = [[10., 2345., 0., 2.],
        [3., -3490., 0.1, 1.99],
        [13., 3903., -0.2, 2.11]]

print preprocessing.normalize(data)

[[  4.26435200e-03   9.99990544e-01   0.00000000e+00   8.52870400e-04]
 [  8.59598396e-04  -9.99999468e-01   2.86532799e-05   5.70200269e-04]
 [  3.33075223e-03   9.99994306e-01  -5.12423421e-05   5.40606709e-04]]

降维

在[20]:

from sklearn import decomposition

data = [[0.3, 0.2, 0.4,  0.32],
        [0.3, 0.5, 1.0, 0.19],
        [0.3, -0.4, -0.8, 0.22]]

pca = decomposition.PCA()
print pca.fit_transform(data)
print pca.explained_variance_ratio_

[[ -2.23442295e-01  -7.71447891e-02   8.06250485e-17]
 [ -8.94539226e-01   5.14200202e-02   8.06250485e-17]
 [  1.11798152e+00   2.57247689e-02   8.06250485e-17]]
[  9.95611223e-01   4.38877684e-03   9.24548594e-33]

机器学习模型

分类(支持向量机)

在[21]:

from sklearn import datasets
from sklearn import svm

在[22]:

iris = datasets.load_iris()

X = iris.data[:, :2]
y = iris.target

# Training the model
clf = svm.SVC(kernel=‘rbf‘)
clf.fit(X, y)

# Doing predictions
new_data = [[4.85, 3.1], [5.61, 3.02]]
print clf.predict(new_data)

[0 1]

回归(线性回归)

在[23]:

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

def f(x):
    return x + np.random.random() * 3.

X = np.arange(0, 5, 0.5)
X = X.reshape((len(X), 1))
y = map(f, X)

clf = linear_model.LinearRegression()
clf.fit(X, y)

(23):

LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

在[24]:

new_X = np.arange(0.2, 5.2, 0.3)
new_X = new_X.reshape((len(new_X), 1))
new_y = clf.predict(new_X)

plt.scatter(X, y, color=‘g‘, label=‘Training data‘)

plt.plot(new_X, new_y, ‘.-‘, label=‘Predicted‘)
plt.legend()

(24):

<matplotlib.legend.Legend at 0x10a38f290>

集群(DBScan)

在[25]:

from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=200, centers=centers, cluster_std=0.4,
                            random_state=0)
X = StandardScaler().fit_transform(X)

在[26]:

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
db.labels_

出[26]:

array([-1,  0,  2,  1,  1,  2, -1,  0,  0, -1, -1,  0,  0,  2, -1, -1,  2,
        0,  1,  0,  0,  2, -1, -1,  0, -1, -1,  1, -1,  2,  1, -1,  1, -1,
        1,  0,  1,  0,  0,  2,  2, -1,  2,  1,  0,  1,  0,  1,  2,  1,  1,
        2, -1,  2,  1, -1,  0,  0, -1,  1,  0,  0,  1,  2,  0, -1,  2,  1,
       -1,  0,  0,  1,  1,  0, -1,  2, -1,  1,  2,  2,  0,  2,  1,  0, -1,
        0,  2,  1, -1,  2,  0, -1,  1,  1,  2,  0,  2,  1,  2,  1,  2,  2,
       -1,  2,  0,  1,  0, -1,  2,  0,  1,  0,  0, -1,  1,  0,  2,  2,  0,
        1,  0, -1,  1,  0,  1,  1,  1, -1,  1,  2,  1, -1, -1,  0,  0,  2,
        1,  1, -1,  0,  1,  2,  1,  0,  0, -1,  2,  1,  1,  1,  2,  2,  0,
        0,  2, -1,  1,  0,  1,  1,  2,  1,  2,  1,  0, -1,  2,  0,  2,  1,
        2,  1,  0,  1,  2,  0,  1, -1,  2,  0,  0,  1,  1,  1, -1,  0,  1,
        0,  1,  2, -1, -1,  2,  1,  0,  0,  2, -1,  2,  0])

在[27]:

import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=db.labels_)

(27):

<matplotlib.collections.PathCollection at 0x10a6bc110>

交叉验证

在[28]:

from sklearn import svm, cross_validation, datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

model = svm.SVC()
print cross_validation.cross_val_score(model, X, y, scoring=‘precision‘)
print cross_validation.cross_val_score(model, X, y, scoring=‘mean_squared_error‘)

[ 0.98148148  0.96491228  0.98039216]
[-0.01960784 -0.03921569 -0.02083333]

from:http://blog.csdn.net/pipisorry/article/details/44833603

ref:Data-processing and machine learning with Python

http://nbviewer.ipython.org/github/halflings/python-data-workshop/blob/master/data-workshop-notebook.ipynb

时间： 2024-10-03 19:21:44

Python下的数据处理和机器学习，对数据在线及本地获取、解析、预处理和训练、预测、交叉验证、可视化的相关文章

机器学习：weka中Evaluation类源码解析及输出AUC及交叉验证介绍

在机器学习分类结果的评估中,ROC曲线下的面积AOC是一个非常重要的指标.下面是调用weka类,输出AOC的源码: try { // 1.读入数据集 Instances data = new Instances( new BufferedReader( new FileReader("E:\\Develop/Weka-3-6/data/contact-lenses.arff"))); data.setClassIndex(data.numAttributes() - 1); // 2.

机器学习-CrossValidation交叉验证详解

版权声明:本文为原创文章,转载请注明来源. 1.原理 1.1 概念交叉验证(Cross-validation)主要用于模型训练或建模应用中,如分类预测.PCR.PLS回归建模等.在给定的样本空间中,拿出大部分样本作为训练集来训练模型,剩余的小部分样本使用刚建立的模型进行预测,并求这小部分样本的预测误差或者预测精度,同时记录它们的加和平均值.这个过程迭代K次,即K折交叉.其中,把每个样本的预测误差平方加和,称为PRESS(predicted Error Sum of Squares). 1.2

来，了解下用Python实现的四种机器学习技术！

机器学习技术VS.算法虽然本教程专门讨论Python的机器学习技术,但我们很快就会转向算法.但在我们开始关注技术和算法之前,让我们先看看它们是否是同一回事. A 技术是解决问题的方法.这是一个相当通用的术语.但当我们说我们有一个算法,我们的意思是,我们有一个输入,并希望从它得到一个特定的输出.我们已经明确规定了实现这一目标所应遵循的步骤.我们将不遗余力地说,一个算法可以使用多种技术来获得输出. 现在我们已经区分了这两种技术,让我们来了解更多关于机器学习技术的内容. 用Python实现机器学习技

【转帖】Python在大数据分析及机器学习中的兵器谱

Flask:Python系的轻量级Web框架. 1. 网页爬虫工具集 Scrapy 推荐大牛pluskid早年的一篇文章:<Scrapy 轻松定制网络爬虫> Beautiful Soup 客观的说,Beautifu Soup不完全是一套爬虫工具,需要配合urllib使用,而是一套HTML/XML数据分析,清洗和获取工具. Python-Goose Goose最早是用Java写得,后来用Scala重写,是一个Scala项目.Python-Goose用Python重写,依赖了Beautiful S

挨踢部落坐诊第三期:Python在大数据处理上的优势分析

挨踢部落是为核心开发者提供深度技术交流,解决开发需求,资源共享的服务社群.基于此社群,我们邀请了业界技术大咖对开发需求进行一对一突破,解除开发过程中的绊脚石.以最专业.最高效的答复为开发者解决开发难题. Python 话题关键词:数据库 Android 部落阵容:侯圣文,恩墨学院联合创始人: 面向对象:移动开发者.IT运维.数据分析师参与方式:加入51CTO开发者QQ交流群370892523,有任何技术问题,在群里提问,或发给群主小官. 活动详情: 问:郑州-白杨-Web:现在还有业务在使用S

python下PCA算法与人脸识别

关于这部分主要是想在python下试验一下主成分分析(PCA)算法以及简单的人脸识别.曾经详述过matlab下的PCA以及SVM算法进行人脸识别技术,参考如下: 主成分分析法-简单人脸识别(一) 主成分分析-简单人脸识别(二) PCA实验人脸库-人脸识别(四) PCA+支持向量机-人脸识别(五) 主成分分析(PCA)算法主要是对高维数据进行降维,最大限度的找到数据间的相互关系,在机器学习.数据挖掘上很有用.在机器学习领域算法众多,贴一个: 大神博客索引关于PCA的核心思想与原理介绍上述已经给出

sklearn：Python语言开发的通用机器学习库

引言:深入理解机器学习并完全看懂sklearn文档,需要较深厚的理论基础.但是,要将sklearn应用于实际的项目中,只需要对机器学习理论有一个基本的掌握,就可以直接调用其API来完成各种机器学习问题.本文选自<全栈数据之门>,将向你介绍通过三个步骤来解决具体的机器学习问题. sklearn介绍 scikit-learn是Python语言开发的机器学习库,一般简称为sklearn,目前算是通用机器学习算法库中实现得比较完善的库了.其完善之处不仅在于实现的算法多,还包括大量详尽的文档和示例.其文

面向机器学习：数据平台设计与搭建实践

机器学习作为近几年的一项热门技术,不仅凭借众多"人工智能"产品而为人所熟知,更是从根本上增能了传统的互联网产品.在近期举办的2018 ArchSummit全球架构师峰会上,个推首席数据架构师袁凯,基于他在数据平台的建设以及数据产品研发的多年经验,分享了<面向机器学习数据平台的设计与搭建>. 一.背景:机器学习在个推业务中的应用场景作为独立的智能大数据服务商,个推主要业务包括开发者服务.精准营销服务和各垂直领域的大数据服务.而机器学习技术在多项业务及产品中均有涉及:基于用户

机器学习基础：(Python)训练集测试集分割与交叉验证

在上一篇关于Python中的线性回归的文章之后,我想再写一篇关于训练测试分割和交叉验证的文章.在数据科学和数据分析领域中,这两个概念经常被用作防止或最小化过度拟合的工具.我会解释当使用统计模型时,通常将模型拟合在训练集上,以便对未被训练的数据进行预测. 在统计学和机器学习领域中,我们通常把数据分成两个子集:训练数据和测试数据,并且把模型拟合到训练数据上,以便对测试数据进行预测.当做到这一点时,可能会发生两种情况:模型的过度拟合或欠拟合.我们不希望出现这两种情况,因为这会影响模型的可预测性.我们有