背着锄头的互联网农民

0%

Kaggle房价预测(1)

本文针对Kaggle中的房价预测题目,做一些初步探索。

特征相关性

首先使用DataFrame的corr函数输出SalePrice和其他特征的相关性,默认是皮尔森相关系数。因为有些特征是字符串格式,使用corr计算不出来,需要将其转换为数值特征后才行。这里先不考虑非数值特征,根据和SalePrice的相关性,选取出来的特征有:OverallQual,GarageCars,GarageArea,GrLivArea,TotalBsmtSF,1stFlrSF, YearBuilt,YearRemodAdd,FullBath,TotRmsAbvGrd,基本上是>0.5或者<-0.5的所有特征。

1
2
train_data = pd.read_csv('./data/train.csv')
corrmat = train_data.corr()

预处理

对上面挑选出来的特征做填充缺失值和归一化,归一化使用最大最小归一。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def pre_process(df):
overallQualOneHot = pd.get_dummies(df["OverallQual"], prefix="OverallQual")
df = pd.concat([df, overallQualOneHot], axis=1)
df = df.drop(["OverallQual"], axis=1)

df.GarageCars.fillna(df.GarageCars.median(), inplace=True)
max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))
df["GarageCars"] = df[["GarageCars"]].apply(max_min_scaler);

df.GarageArea.fillna(df.GarageArea.median(), inplace=True)
df["GarageArea"] = df[["GarageArea"]].apply(max_min_scaler);

df["GrLivArea"] = df[["GrLivArea"]].apply(max_min_scaler);

df.TotalBsmtSF.fillna(df.TotalBsmtSF.median(), inplace=True)
df["TotalBsmtSF"] = df[["TotalBsmtSF"]].apply(max_min_scaler);
df["1stFlrSF"] = df[["1stFlrSF"]].apply(max_min_scaler);
df["YearBuilt"] = df[["YearBuilt"]].apply(max_min_scaler);
df["YearRemodAdd"] = df[["YearRemodAdd"]].apply(max_min_scaler);
df["FullBath"] = df[["FullBath"]].apply(max_min_scaler);
df["TotRmsAbvGrd"] = df[["TotRmsAbvGrd"]].apply(max_min_scaler);
print df.info()
return df

模型训练

使用普通的线性回归模型做预测,上传到Kaggle平台得到的误差为0.171。

1
2
3
4
def linear_regress(x_train, y_train):
model = LinearRegression()
model.fit(x_train, y_train)
return model

回归正态化

线性回归的假设前提是噪音服从正态分布,进而预测结果需要服从正态分布。如果预测结果不满足正态分布的情况,需要使用np.log做转换,转换后的结果基本都满足正态分布。该步骤会提升回归效果,做完该步骤后误差降低到0.156。

1
2
3
train_data['SalePrice'] = np.log(train_data['SalePrice'])

predict = np.exp(model.predict(X_test)) #预测的时候再用指数函数还原回去

增加特征

上面训练模型时候只用了数值特征,而实际上还有很多类别特征。但是coor函数无法直接计算类别特征的相关性,这里我们使用sklearn中的chi2、f_classif、mutual_info_classif等函数来计算类别特征和数值特征的相关性。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
le = preprocessing.LabelEncoder()
le.fit(combined_data["Neighborhood"])
combined_data["Neighborhood"] = le.transform(combined_data["Neighborhood"])
le.fit(combined_data["ExterQual"])
combined_data["ExterQual"] = le.transform(combined_data["ExterQual"])
le.fit(combined_data["KitchenQual"])
combined_data["KitchenQual"] = le.transform(combined_data["KitchenQual"])
le.fit(combined_data["Foundation"])
combined_data["Foundation"] = le.transform(combined_data["Foundation"])
le.fit(combined_data["HeatingQC"])
combined_data["HeatingQC"] = le.transform(combined_data["HeatingQC"])

train_data = combined_data[:1460];
train_data['SalePrice'] = np.log(train_data['SalePrice'])
print f_classif(train_data["SalePrice"].to_frame(), train_data["Neighborhood"]);
print f_classif(train_data["SalePrice"].to_frame(), train_data["ExterQual"]);
print f_classif(train_data["SalePrice"].to_frame(), train_data["KitchenQual"]);
print f_classif(train_data["SalePrice"].to_frame(), train_data["Foundation"]);
print f_classif(train_data["SalePrice"].to_frame(), train_data["HeatingQC"]);

print ""
print chi2(train_data["SalePrice"].to_frame(), train_data["Neighborhood"]);
print chi2(train_data["SalePrice"].to_frame(), train_data["ExterQual"]);
print chi2(train_data["SalePrice"].to_frame(), train_data["KitchenQual"]);
print chi2(train_data["SalePrice"].to_frame(), train_data["Foundation"]);
print chi2(train_data["SalePrice"].to_frame(), train_data["HeatingQC"]);

print ""
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["Neighborhood"]);
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["ExterQual"]);
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["KitchenQual"]);
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["Foundation"]);
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["HeatingQC"]);

使用该方法判断出比较相关性的特征有:Neighborhood,ExterQual,KitchenQual, Foundation, HeatingQC,下面将这些特征加到模型中并做One-Hot编码,重新训练模型得到误差为:0.150。

模型融合

使用stacking的模型融合方法,得到误差为0.148

1
2
3
4
5
6
7
8
9
10
11
12
13
14
svr = SVR(kernel='linear')
regressor = LinearRegression()
lasso = Lasso(alpha=0.06)
ridge = Ridge(alpha=1)
rf = RandomForestRegressor(n_estimators=5, random_state=42)
xgboost = XGBRegressor(learning_rate=0.01, n_estimators=3460,
max_depth=3, min_child_weight=0,
gamma=0, subsample=0.7,
colsample_bytree=0.7,
objective='reg:linear', nthread=-1,
scale_pos_weight=1, seed=27,
reg_alpha=0.00006)
stack = StackingCVRegressor(regressors=(ridge, regressor, svr, lasso, rf), meta_regressor=regressor, random_state=42);
stack.fit(X_train, Y_train);