Kaggle比赛(二)House Prices: Advanced Regression Techniques

房价预测是我入门Kaggle的第二个比赛,参考学习了他人的一篇优秀教程:https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

通过Serigne的这篇notebook,我学习到了关于数据分析、特征工程、集成学习等等很多有用的知识,在这里感谢一下这位大佬。

本篇文章立足于Serigne的教程,将他的大部分代码实现了一遍,修正了个别小错误,也加入了自己的一些视角和思考,做了一些自认为reasonable的“改进”。虽然最后结果反而变差了一点(没有道理啊!QAQ),但我觉得整个实践的过程仍然值得记录一下。

废话不多说,下面进入正文。

数据集概览

导入相关Python包:

#import some necessary librairies

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)

from scipy import stats
from scipy.stats import norm, skew #for some statistics

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #check the files available in the directory

sample_submission.csv
test.csv
train.csv

读取csv文件:

train = pd.read_csv('datasets/train.csv')
test = pd.read_csv('datasets/test.csv')

查看数据集前5行:

train.head(5)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.000 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.000 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.000 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.000 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.000 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

test.head(5)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.000 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.000 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.000 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.000 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.000 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

查看训练、测试集的大小:

#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape))
print("The test data size after dropping Id feature is : {} ".format(test.shape))

The train data size before dropping Id feature is : (1460, 81)
The test data size before dropping Id feature is : (1459, 80)

The train data size after dropping Id feature is : (1460, 80)
The test data size after dropping Id feature is : (1459, 79)

特征工程

离群值处理

通过绘制散点图可以直观地看出特征是否有离群值,这里以GrLivArea为例。

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

我们可以看到图像右下角的两个点有着很大的GrLivArea,但相应的SalePrice却异常地低,我们有理由相信它们是离群值,要将其剔除。

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

需要注意删除离群值并不总是安全的,我们不能也不必将所有的离群值全部剔除,因为测试集中依然会有一些离群值。用带有一定噪声的数据训练出的模型会具有更高的鲁棒性,从而在测试集中表现得更好。

目标值分析

SalePrice是我们将要预测的目标,有必要对其进行分析和处理。

我们画出SalePrice的分布图和QQ图(Quantile Quantile Plot)。这里简单说一下QQ图,它是由标准正态分布的分位数为横坐标,样本值为纵坐标的散点图。如果QQ图上的点在一条直线附近,则说明数据近似于正态分布,且该直线的斜率为标准差,截距为均值。对于QQ图的详细介绍可以参考这篇文章:https://blog.csdn.net/hzwwpgmwy/article/details/79178485

sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

mu = 180932.92 and sigma = 79467.79

SalePrice的分布呈正偏态,而线性回归模型要求因变量服从正态分布。我们对其做对数变换,让数据接近正态分布。

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

mu = 12.02 and sigma = 0.40

正态分布的数据有很多好的性质,使得后续的模型训练有更好的效果。另一方面,由于这次比赛最终是对预测值的对数的误差进行评估,所以我们在本地测试的时候也应该用同样的标准。

特征相关性

相关性矩阵热图表现特征与目标值之间以及两两特征之间的相关程度,对特征的处理有指导意义。

#Correlation map to see how features are correlated with SalePrice
corrmat = train.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)

缺失值处理

首先将训练集和测试集合并在一起:

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

all_data size is : (2917, 79)

统计各个特征的缺失情况:

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
Missing Ratio
PoolQC 99.691
MiscFeature 96.400
Alley 93.212
Fence 80.425
FireplaceQu 48.680
LotFrontage 16.661
GarageQual 5.451
GarageCond 5.451
GarageFinish 5.451
GarageYrBlt 5.451
GarageType 5.382
BsmtExposure 2.811
BsmtCond 2.811
BsmtQual 2.777
BsmtFinType2 2.743
BsmtFinType1 2.708
MasVnrType 0.823
MasVnrArea 0.788
MSZoning 0.137
BsmtFullBath 0.069
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

data_description.txt中已有说明,一部分特征值的缺失是因为这些房子根本没有该项特征,对于这种情况我们统一用“None”或者“0”来填充。

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')

对于缺失较少的离散型特征,可以用众数填补缺失值。

all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

对于LotFrontage项,由于每个Neighborhood的房子的LotFrontage很可能是比较相近的,所以我们可以用各个房子所在Neighborhood的LotFrontage的中位数作为填充值。

#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

data_description.txt中还提到过,Functional默认是“Typ”。

all_data["Functional"] = all_data["Functional"].fillna("Typ")

Utilities特征有两个缺失值,且只有一个样本是“NoSeWa”,除此之外全部都是“AllPub”,因此该项特征的方差非常小,我们可以直接将其删去。

all_data = all_data.drop(['Utilities'], axis=1)

最后确认缺失值是否已全部处理完毕:

all_data.isnull().sum().max()

0

进一步挖掘特征

我们注意到有些特征虽然是数值型的,但其实表征的只是不同类别,其数值的大小并没有实际意义,因此我们将其转化为类别特征。

all_data['MSSubClass'] = all_data['MSSubClass'].astype(str)
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

反过来,有些类别特征实际上有高低好坏之分,这些特征的质量越高,就可能在一定程度导致房价越高。我们将这些特征的类别映射成有大小的数字,以此来表征这种潜在的偏序关系。

all_data['FireplaceQu'] = all_data['FireplaceQu'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['GarageQual'] = all_data['GarageQual'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['GarageCond'] = all_data['GarageCond'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['GarageFinish'] = all_data['GarageFinish'].map({'Fin': 3, 'RFn': 2, 'Unf': 1, 'None': 0})
all_data['BsmtQual'] = all_data['BsmtQual'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['BsmtCond'] = all_data['BsmtCond'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['BsmtExposure'] = all_data['BsmtExposure'].map({'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'None': 0})
all_data['BsmtFinType1'] = all_data['BsmtFinType1'].map({'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'None': 0})
all_data['BsmtFinType2'] = all_data['BsmtFinType2'].map({'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'None': 0})
all_data['ExterQual'] = all_data['ExterQual'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['ExterCond'] = all_data['ExterCond'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['HeatingQC'] = all_data['HeatingQC'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['PoolQC'] = all_data['PoolQC'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['KitchenQual'] = all_data['KitchenQual'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
all_data['Functional'] = all_data['Functional'].map({'Typ': 8, 'Min1': 7, 'Min2': 6, 'Mod': 5, 'Maj1': 4, 'Maj2': 3, 'Sev': 2, 'Sal': 1, 'None': 0})
all_data['Fence'] = all_data['Fence'].map({'GdPrv': 4, 'MnPrv': 3, 'GdWo': 2, 'MnWw': 1, 'None': 0})
all_data['LandSlope'] = all_data['LandSlope'].map({'Gtl': 3, 'Mod': 2, 'Sev': 1, 'None': 0})
all_data['LotShape'] = all_data['LotShape'].map({'Reg': 4, 'IR1': 3, 'IR2': 2, 'IR3': 1, 'None': 0})
all_data['PavedDrive'] = all_data['PavedDrive'].map({'Y': 3, 'P': 2, 'N': 1, 'None': 0})
all_data['Street'] = all_data['Street'].map({'Pave': 2, 'Grvl': 1, 'None': 0})
all_data['Alley'] = all_data['Alley'].map({'Pave': 2, 'Grvl': 1, 'None': 0})
all_data['CentralAir'] = all_data['CentralAir'].map({'Y': 1, 'N': 0})

利用一些重要的特征构造更多的特征:

all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
all_data['OverallQual_TotalSF'] = all_data['OverallQual'] * all_data['TotalSF']
all_data['OverallQual_GrLivArea'] = all_data['OverallQual'] * all_data['GrLivArea']
all_data['OverallQual_TotRmsAbvGrd'] = all_data['OverallQual'] * all_data['TotRmsAbvGrd']
all_data['GarageArea_YearBuilt'] = all_data['GarageArea'] + all_data['YearBuilt']

Box-Cox变换

对于数值型特征,我们希望它们尽量服从正态分布,也就是不希望这些特征出现正负偏态。

那么我们先来计算一下各个特征的偏度:

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew': skewed_feats})
skewness.head(10)
Skew
MiscVal 21.940
PoolQC 19.549
PoolArea 17.689
LotArea 13.109
LowQualFinSF 12.085
3SsnPorch 11.372
KitchenAbvGr 4.301
BsmtFinSF2 4.145
Alley 4.137
EnclosedPorch 4.002

可以看到这些特征的偏度较高,需要做适当的处理。这里我们对数值型特征做Box-Cox变换,以改善数据的正态性、对称性和方差相等性。更多关于Box-Cox变换的知识可以参考这篇博客:https://blog.csdn.net/sinat_26917383/article/details/77864582

skewness = skewness[abs(skewness['Skew']) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    all_data[feat] = boxcox1p(all_data[feat], lam)

There are 41 skewed numerical features to Box Cox transform

独热编码

对于类别特征,我们将其转化为独热编码,这样既解决了模型不好处理属性数据的问题,在一定程度上也起到了扩充特征的作用。

all_data = pd.get_dummies(all_data)
print(all_data.shape)

(2917, 254)

现在我们有了经过处理后的训练集和测试集:

train = all_data[:ntrain]
test = all_data[ntrain:]

至此,特征工程就算完成了。

建立模型

导入算法包:

from sklearn.linear_model import ElasticNet, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

标准化

由于数据集中依然存在一定的离群点,我们首先用Robustscaler对数据进行标准化处理。

scaler = RobustScaler()
train = scaler.fit_transform(train)
test = scaler.transform(test)

评价函数

先定义一个评价函数。我们采用5折交叉验证。与比赛的评价标准一致,我们用Root-Mean-Squared-Error (RMSE)来为每个模型打分。

#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

基本模型

  • 套索回归
lasso = Lasso(alpha=0.0005, random_state=1)
  • 弹性网络
ENet = ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3)
  • 核岭回归
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
  • 梯度提升回归
GBoost = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10,
                                   loss='huber', random_state =5)
  • XGBoost
model_xgb = xgb.XGBRegressor(colsample_bytree=0.5, gamma=0.05,
                             learning_rate=0.05, max_depth=3,
                             min_child_weight=1.8, n_estimators=1000,
                             reg_alpha=0.5, reg_lambda=0.8,
                             subsample=0.5, silent=1,
                             random_state =7, nthread = -1)
  • LightGBM
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=1000,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

看看它们的表现如何:

score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

Lasso score: 0.1115 (0.0073)

ElasticNet score: 0.1115 (0.0073)

Kernel Ridge score: 0.1189 (0.0045)

Gradient Boosting score: 0.1140 (0.0085)

Xgboost score: 0.1185 (0.0081)

LGBM score: 0.1175 (0.0079)

Stacking方法

集成学习往往能进一步提高模型的准确性,Stacking是其中一种效果颇好的方法,简单来说就是学习各个基本模型的预测值来预测最终的结果。详细步骤可参考:https://www.jianshu.com/p/59313f43916f

这里我们用ENetKRRGBoost作为第一层学习器,用Lasso作为第二层学习器:

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds

    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)

        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred

        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self

    #Do the predictions of all base models on the test data and use the averaged predictions as
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

Stacking的交叉验证评分:

stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)
score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

Stacking Averaged models score: 0.1081 (0.0085)

我们得到了比单个基学习器更好的分数。

建立最终模型

我们将XGBoostLightGBMStackedRegressor加权平均的方式融合在一起,建立最终的预测模型。

先定义一个评价函数:

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

用整个训练集训练模型,预测测试集的房价,并给出模型在训练集上的评分。

  • StackedRegressor
stacked_averaged_models.fit(train, y_train)
stacked_train_pred = stacked_averaged_models.predict(train)
stacked_pred = np.expm1(stacked_averaged_models.predict(test))
print(rmsle(y_train, stacked_train_pred))

0.08464515778854238

  • XGBoost
model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
xgb_pred = np.expm1(model_xgb.predict(test))
print(rmsle(y_train, xgb_train_pred))

0.08362948457258125

  • LightGBM
model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
lgb_pred = np.expm1(model_lgb.predict(test))
print(rmsle(y_train, lgb_train_pred))

0.06344397467222622

融合模型的评分:

print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 + xgb_train_pred*0.15 + lgb_train_pred*0.15))

RMSLE score on train data:
0.07939492590501797

预测

ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

生成提交文件

sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = ensemble
sub.to_csv('submission.csv',index=False)

原文地址:https://www.cnblogs.com/adamding/p/11390487.html

时间: 2024-10-11 21:16:16

Kaggle比赛(二)House Prices: Advanced Regression Techniques的相关文章

Kaggle比赛总结

做完 Kaggle 比赛已经快五个月了,今天来总结一下,为秋招做个准备. 题目要求:根据主办方提供的超过 4 天约 2 亿次的点击数据,建立预测模型预测用户是否会在点击移动应用广告后下载应用程序. 数据集特点: 数据量很大,有 2 亿条之多 数据是不平衡的,点击下载的数量远远小于没有点击下载的数量 不平衡数据集的处理思路: 一般对样本进行 上采样 和 下采样,顾名思义就是 多的样本少采一点,少的样本多采一点.极端情况下,就是样本太多的时候,就可以做增强学习,就是我给我的少样本增加噪音.但是由于我

Kaggle比赛:从何着手?

介绍 参加Kaggle比赛,我必须有哪些技能呢? 你有没有面对过这样的问题?最少在我大二的时候,我有过.过去我仅仅想象Kaggle比赛的困难度,我就感觉害怕.这种恐惧跟我怕水的感觉相似.怕水,让我无法参加一些游泳课程.然而,后来,我得到的教训是只要你不真的跨进水里,你就不知道水有多深.相同的哲学对Kaggle也一样适用.没有试过之前不要下结论. Kaggle,数据科学的家园,为竞赛参与者,客户解决方案和招聘求职提供了一个全球性的平台.这是Kaggle的特殊吸引力,它提供的竞赛不仅让你站到不同的高

7 Types of Regression Techniques you should know!

7 Types of Regression Techniques you should know! Introduction Linear and Logistic regressions are usually the first algorithms people learn in predictive modeling. Due to their popularity, a lot of analysts even end up thinking that they are the onl

7 Types of Regression Techniques

https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/ What is Regression Analysis? Why do we use Regression Analysis? What are the types of Regressions? Linear Regression Logistic Regression Polynomial Regression Stepwise Regre

记一次失败的kaggle比赛(3):失败在什么地方,贪心筛选特征、交叉验证、blending

今天这个比赛结束了,结果可以看:https://www.kaggle.com/c/santander-customer-satisfaction/leaderboard public结果: private结果: 首先对比private和public的结果,可以发现: 1)几乎所有的人都overfitting了:或者说private的另一半测试数据比public的那一半测试数据更不规律. 2)private的前十名有5个是在public中排不进前几百,有四个甚至排在1000名到2000名之间:说明

codevs1959拔河比赛(二维费用背包)

1959 拔河比赛 时间限制: 1 s 空间限制: 128000 KB 题目等级 : 黄金 Gold 题目描述 Description 一个学校举行拔河比赛,所有的人被分成了两组,每个人必须(且只能够)在其中的一组,要求两个组的人数相差不能超过1,且两个组内的所有人体重加起来尽可能地接近. 输入描述 Input Description 数据的第1行是一个n,表示参加拔河比赛的总人数,n<=100,接下来的n行表示第1到第n个人的体重,每个人的体重都是整数(1<=weight<=450).

kaggle比赛流程

一.比赛概述 不同比赛有不同的任务,分类.回归.推荐.排序等.比赛开始后训练集和测试集就会开放下载. 比赛通常持续 2 ~ 3 个月,每个队伍每天可以提交的次数有限,通常为 5 次. 比赛结束前一周是一个 Deadline,在这之后不能再组队,也不能再新加入比赛.所以想要参加比赛请务必在这一 Deadline 之前有过至少一次有效的提交. 一般情况下在提交后会立刻得到得分的反馈.不同比赛会采取不同的评分基准,可以在分数栏最上方看到使用的评分方法. 反馈的分数是基于测试集的一部分计算的,剩下的另一

我的第一个 Kaggle 比赛学习 - Titanic

背景 Titanic: Machine Learning from Disaster - Kaggle 2 年前就被推荐照着这个比赛做一下,结果我打开这个页面便蒙了,完全不知道该如何下手. 两年后,再次打开这个页面,看到清清楚楚的Titanic Tutorial - Kaggle,完全傻瓜式的照着做就能做下来.当年是什么蒙蔽了我的眼睛~ Target use machine learning to create a model that predicts which passengers sur

收集几个打算做的kaggle比赛

1.  Display Advertising Challenge 目测是关于广告CTR预估的. .新实习看起来是做广告的,恶补一个. 2. Quora Question Pairs 美丽的小姐姐给我推荐的比赛,判断两个quora上的问题是否是一样的.一颗赛艇.