XGBoost使用教程(进阶篇)三

一、Importing all the libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.metrics import accuracy_score
二、Reading the file

还是蘑菇数据集,直接采用Kaggle竞赛中22维特征 https://www.kaggle.com/uciml/mushroom-classification

数据集下载地址:http://download.csdn.net/download/u011630575/10266626

# path to where the data lies
dpath = ‘./data/‘
data = pd.read_csv(dpath+"mushrooms.csv")
data.head(6)
三、Let us check if there is any null values
data.isnull().sum() #检查数据有没有空值
四、check if we have two claasification. Either the mushroom is poisonous or edible
data[‘class‘].unique() #检查是否只有蘑菇的种类,有毒,可使用
print(data.dtypes)
五、check if 22 features(1st one is label) and 8124 instances
data.shape #22个特征 8124个样例 第一个是标签
六、The dataset has values in strings. We need to convert all the unique values to integers. Thus we perform label encoding on the data 标准化标签

from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder() #标准化标签,将标签值统一转换成range(标签值个数-1)范围内
for col in data.columns:
data[col] = labelencoder.fit_transform(data[col])

data.head()
Separating features and label

X = data.iloc[:,1:23] # 获取1-23行特征
y = data.iloc[:, 0] # 获取0行标签
X.head()
y.head()
Splitting the data into training and testing dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=4)
七、default Logistic Regression

from sklearn.linear_model import LogisticRegression
model_LR= LogisticRegression()
model_LR.fit(X_train,y_train)
y_prob = model_LR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
model_LR.score(X_test, y_pred)
注:np.where(condition,x,y) 是三元运算符,conditon条件成立则结果为x,否则为y。

accuracy

auc_roc=metrics.roc_auc_score(y_test,y_pred)
print(auc_roc)
八、Logistic Regression(Tuned model) 调整模型

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn import metrics

LR_model= LogisticRegression()

tuned_parameters = {‘C‘: [0.001, 0.01, 0.1, 1, 10, 100, 1000] ,
‘penalty‘: [‘l1‘,‘l2‘]
}
九、CV

from sklearn.model_selection import GridSearchCV

LR= GridSearchCV(LR_model, tuned_parameters,cv=10)
LR.fit(X_train,y_train)
print(LR.best_params_)
y_prob = LR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
LR.score(X_test, y_pred)
auc_roc=metrics.roc_auc_score(y_test,y_pred)
print(auc_roc)
十、Default Decision Tree model

from sklearn.tree import DecisionTreeClassifier

model_tree = DecisionTreeClassifier()
model_tree.fit(X_train, y_train)
y_prob = model_tree.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
model_tree.score(X_test, y_pred)
auc_roc=metrics.roc_auc_score(y_test,y_pred)
auc_roc
十一、Let us tune the hyperparameters of the Decision tree model

from sklearn.tree import DecisionTreeClassifier

model_DD = DecisionTreeClassifier()

tuned_parameters= { ‘max_features‘: ["auto","sqrt","log2"],
‘min_samples_leaf‘: range(1,100,1) , ‘max_depth‘: range(1,50,1)
}
#tuned_parameters= { ‘max_features‘: ["auto","sqrt","log2"] }

#If “auto”, then max_features=sqrt(n_features).
from sklearn.model_selection import GridSearchCV
DD = GridSearchCV(model_DD, tuned_parameters,cv=10)
DD.fit(X_train, y_train)
print(DD.grid_scores_)
print(DD.best_score_)
print(DD.best_params_)
y_prob = DD.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
DD.score(X_test, y_pred)
auc_roc=metrics.classification_report(y_test,y_pred)
print(auc_roc)
十二、Default Random Forest
from sklearn.ensemble import RandomForestClassifier

model_RR=RandomForestClassifier()
model_RR.fit(X_train,y_train)
y_prob = model_RR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
model_RR.score(X_test, y_pred)
auc_roc=metrics.roc_auc_score(y_test,y_pred)
auc_roc
十三、Let us tuned the parameters of Random Forest just for the purpose of knowledge
1) max_features

2) n_estimators  估计量

3) min_sample_leaf

from sklearn.ensemble import RandomForestClassifier

model_RR=RandomForestClassifier()

tuned_parameters = {‘min_samples_leaf‘ range(10,100,10), ‘n_estimators‘ : range(10,100,10),
‘max_features‘:[‘auto‘,‘sqrt‘,‘log2‘]
}

from sklearn.model_selection import GridSearchCV
RR = GridSearchCV(model_RR, tuned_parameters,cv=10)

RR.fit(X_train,y_train)

print(RR.grid_scores_)

print(RR.best_score_)

print(RR.best_params_)

y_prob = RR.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
RR_model.score(X_test, y_pred)

auc_roc=metrics.roc_auc_score(y_test,y_pred)
auc_roc
十四、Default  XGBoost

from xgboost import XGBClassifier
model_XGB=XGBClassifier()
model_XGB.fit(X_train,y_train)
y_prob = model_XGB.predict_proba(X_test)[:,1] # This will give you positive class prediction probabilities
y_pred = np.where(y_prob > 0.5, 1, 0) # This will threshold the probabilities to give class predictions.
model_XGB.score(X_test, y_pred)
auc_roc=metrics.roc_auc_score(y_test,y_pred)
auc_roc
十五、特征重要性
在XGBoost中特征重要性已经自动算好,存放在featureimportances

print(model_XGB.feature_importances_)
from matplotlib import pyplot
pyplot.bar(range(len(model_XGB.feature_importances_)), model_XGB.feature_importances_)
pyplot.show()
# plot feature importance using built-in function
from xgboost import plot_importance
plot_importance(model_XGB)
pyplot.show()
可以根据特征重要性进行特征选择
from numpy import sort
from sklearn.feature_selection import SelectFromModel

# Fit model using each importance as a threshold
thresholds = sort(model_XGB.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model_XGB, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],
accuracy*100.0))

原文地址:https://www.cnblogs.com/tan2810/p/11154658.html

时间: 2024-10-21 17:16:33

XGBoost使用教程(进阶篇)三的相关文章

Maya基础与建模教程 AE教程进阶篇 3DS MAX影视特效教程 Flash CS4案例教程

热门推荐电脑办公计算机基础知识教程 Excel2010基础教程 Word2010基础教程 PPT2010基础教程 五笔打字视频教程 Excel函数应用教程 Excel VBA基础教程 WPS2013表格教程 更多>平面设计PhotoshopCS5教程 CorelDRAW X5视频教程 Photoshop商业修图教程 Illustrator CS6视频教程 更多>室内设计3Dsmax2012教程 效果图实例提高教程 室内设计实战教程 欧式效果图制作实例教程 AutoCAD2014室内设计 Aut

Java快速教程--vamei 学习笔记(进阶篇)

感谢vamei,学习链接:http://www.cnblogs.com/vamei/archive/2013/03/31/2991531.html Java进阶01 String类 学习链接:http://www.cnblogs.com/vamei/archive/2013/04/08/3000914.html 字符串操作 ---------------------------------------------------------------------------------------

java web进阶篇(三) 表达式语言

表达式语言(Expression Language ,EL)是jsp2.0中新增的功能.可以避免出现许多的Scriptlet代码 格式: ${ 属性名称 },  使用表达式语言可以方便的访问对象中的属性,提交的参数或者进行各种数学运算,而且使用表达式语言最大的特点是如果输出的内容是null,则会自动使用空字符串("")表示. <%request.setAttribute("name", "info");%> <h1>${n

GSON使用的学习笔记,进阶篇(三)

本篇笔记内容比较杂乱,没有专门去整理. TypeAdapter 现在轮到TypeAdapter类上场,但考虑到gson默认行为已足够强大,加上项目实践中应用json时场景不会太复杂,所以一般不需要自定义TypeAdapter.TypeAdapter优点是集成了JsonWriter和JsonReader两个类,定义了一套与gson框架交互的良好接口,同时便于管理编码和解码的实现代码,不至于太零碎.因而在了解JsonReader和JsonWriter的使用方法之后,自定义TypeAdapter类来完

JSON进阶第三篇 apache多域名及JSON的跨域问题(JSONP)

本文先介绍如何为apache配置多域名,然后再用JSONP(JSON with Padding)来解决JSON的跨域问题. 阅读本文之前,推荐先参阅<JSON进阶第二篇AJAX方式传递JSON数据>. 一.apache配置多域名 在apache的conf目录下找到httpd.conf,然后在该文件最后增加如下内容: # 声明使用虚拟主机过滤规则 NameVirtualHost*:80 #虚拟主机localhost <VirtualHost*:80> ServerName  loca

《手把手教你》系列进阶篇之2-python+ selenium自动化测试 - python基础扫盲(详细教程)

1. 简介 这篇文章主要是分享讲解一下,如何封装自己用到的方法和类.以便方便自己和别人的调用,这样就可以避免重复地再造轮子. 封装(Encapsulation)是面向对象的三大特征之一(另外两个是继承和多态),它指的是将对象的状态信息隐藏在对象内部,不允许外部程序直接访问对象内部信息,而是通过该类所提供的方法来实现对内部信息的操作和访问. 就好比使用计算机,我们只需要使用计算机提供的键盘,就可以达到操作计算机的目的,至于在敲击键盘时计算机内部是如何工作,我们根本不需要知道. 封装机制保证了类内部

[转]抢先Mark!微信公众平台开发进阶篇资源集锦

FROM : http://www.csdn.net/article/2014-08-01/2820986 由CSDN和<程序员>杂志联合主办的 2014年微信开发者大会 将于8月23日在北京举行.作为一线微信开发商云集.专注在开发实践方面的顶级技术活动,演讲话题极为丰富,涵盖了微信开发不同维度的多个层内容 (首批议程发布),包括:企业服务号开发和高级应用.企业号开发.如何与业务系统对接.各种高级接口功能.智能客服与LBS.HTML5社交应用.微信支付.微信电商开发等多方面(查看 参加微信开发

WDA-文档-基础篇/进阶篇/讨论篇

本文介绍SAP官方Dynpro开发文档NET310,以及资深开发顾问编写的完整教程. 链接:http://pan.baidu.com/s/1eR9axpg 密码:kf5m NET310 ABAP Web Dynpro目录 单元1: Web Dynpro 简介. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 单元2: Web Dynpro 控制器. . . . . . . . . . . . . . . . . . . . . .

Python之路【第十七篇】:Django【进阶篇 】

Python之路[第十七篇]:Django[进阶篇 ] Model 到目前为止,当我们的程序涉及到数据库相关操作时,我们一般都会这么搞: 创建数据库,设计表结构和字段 使用 MySQLdb 来连接数据库,并编写数据访问层代码 业务逻辑层去调用数据访问层执行数据库操作 import MySQLdb def GetList(sql): db = MySQLdb.connect(user='root', db='wupeiqidb', passwd='1234', host='localhost')

C++混合编程之idlcpp教程Python篇(4)

上一篇在这 C++混合编程之idlcpp教程Python篇(3) 第一篇在这 C++混合编程之idlcpp教程(一) 与前面的工程相似,工程PythonTutorial2中,同样加入了三个文件 PythonTutorial2.cpp, Tutorial2.i, tutorial2.py.其中PythonTutorial2.cpp的内容基本和PythonTutorial1.cpp雷同,不再赘述.首先看一下Tutorial2.i的内容: namespace tutorial { struct Poi