Kaggle房价预测(1)

本文针对Kaggle中的房价预测题目，做一些初步探索。

特征相关性

首先使用DataFrame的corr函数输出SalePrice和其他特征的相关性，默认是皮尔森相关系数。因为有些特征是字符串格式，使用corr计算不出来，需要将其转换为数值特征后才行。这里先不考虑非数值特征，根据和SalePrice的相关性，选取出来的特征有：OverallQual，GarageCars，GarageArea，GrLivArea，TotalBsmtSF，1stFlrSF， YearBuilt，YearRemodAdd，FullBath，TotRmsAbvGrd，基本上是>0.5或者<-0.5的所有特征。

1 2	train_data = pd.read_csv('./data/train.csv') corrmat = train_data.corr()

预处理

对上面挑选出来的特征做填充缺失值和归一化，归一化使用最大最小归一。

def pre_process(df):
    overallQualOneHot = pd.get_dummies(df["OverallQual"], prefix="OverallQual")
    df = pd.concat([df, overallQualOneHot], axis=1)
    df = df.drop(["OverallQual"], axis=1)

    df.GarageCars.fillna(df.GarageCars.median(), inplace=True) 
    max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))
    df["GarageCars"] = df[["GarageCars"]].apply(max_min_scaler);
    
    df.GarageArea.fillna(df.GarageArea.median(), inplace=True) 
    df["GarageArea"] = df[["GarageArea"]].apply(max_min_scaler);

    df["GrLivArea"] = df[["GrLivArea"]].apply(max_min_scaler);

    df.TotalBsmtSF.fillna(df.TotalBsmtSF.median(), inplace=True) 
    df["TotalBsmtSF"] = df[["TotalBsmtSF"]].apply(max_min_scaler);
    df["1stFlrSF"] = df[["1stFlrSF"]].apply(max_min_scaler);
    df["YearBuilt"] = df[["YearBuilt"]].apply(max_min_scaler);
    df["YearRemodAdd"] = df[["YearRemodAdd"]].apply(max_min_scaler);
    df["FullBath"] = df[["FullBath"]].apply(max_min_scaler);    
    df["TotRmsAbvGrd"] = df[["TotRmsAbvGrd"]].apply(max_min_scaler);
    print df.info()
    return df

模型训练

使用普通的线性回归模型做预测，上传到Kaggle平台得到的误差为0.171。

def linear_regress(x_train, y_train):
    model = LinearRegression()
    model.fit(x_train, y_train)
    return model

回归正态化

线性回归的假设前提是噪音服从正态分布，进而预测结果需要服从正态分布。如果预测结果不满足正态分布的情况，需要使用np.log做转换，转换后的结果基本都满足正态分布。该步骤会提升回归效果，做完该步骤后误差降低到0.156。

1
2
3

train_data['SalePrice'] = np.log(train_data['SalePrice'])

predict = np.exp(model.predict(X_test)) #预测的时候再用指数函数还原回去

增加特征

上面训练模型时候只用了数值特征，而实际上还有很多类别特征。但是coor函数无法直接计算类别特征的相关性，这里我们使用sklearn中的chi2、f_classif、mutual_info_classif等函数来计算类别特征和数值特征的相关性。

le = preprocessing.LabelEncoder()
le.fit(combined_data["Neighborhood"])
combined_data["Neighborhood"] = le.transform(combined_data["Neighborhood"])
le.fit(combined_data["ExterQual"])
combined_data["ExterQual"] = le.transform(combined_data["ExterQual"])
le.fit(combined_data["KitchenQual"])
combined_data["KitchenQual"] = le.transform(combined_data["KitchenQual"])
le.fit(combined_data["Foundation"])
combined_data["Foundation"] = le.transform(combined_data["Foundation"])
le.fit(combined_data["HeatingQC"])
combined_data["HeatingQC"] = le.transform(combined_data["HeatingQC"])

train_data = combined_data[:1460];
train_data['SalePrice'] = np.log(train_data['SalePrice'])
print f_classif(train_data["SalePrice"].to_frame(), train_data["Neighborhood"]);
print f_classif(train_data["SalePrice"].to_frame(), train_data["ExterQual"]);
print f_classif(train_data["SalePrice"].to_frame(), train_data["KitchenQual"]);
print f_classif(train_data["SalePrice"].to_frame(), train_data["Foundation"]);
print f_classif(train_data["SalePrice"].to_frame(), train_data["HeatingQC"]);

print ""
print chi2(train_data["SalePrice"].to_frame(), train_data["Neighborhood"]);
print chi2(train_data["SalePrice"].to_frame(), train_data["ExterQual"]);
print chi2(train_data["SalePrice"].to_frame(), train_data["KitchenQual"]);
print chi2(train_data["SalePrice"].to_frame(), train_data["Foundation"]);
print chi2(train_data["SalePrice"].to_frame(), train_data["HeatingQC"]);

print ""
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["Neighborhood"]);
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["ExterQual"]);
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["KitchenQual"]);
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["Foundation"]);
print mutual_info_classif(train_data["SalePrice"].to_frame(), train_data["HeatingQC"]);

使用该方法判断出比较相关性的特征有：Neighborhood，ExterQual，KitchenQual, Foundation, HeatingQC，下面将这些特征加到模型中并做One-Hot编码，重新训练模型得到误差为：0.150。

模型融合

使用stacking的模型融合方法，得到误差为0.148

svr = SVR(kernel='linear')
regressor = LinearRegression()
lasso = Lasso(alpha=0.06)
ridge = Ridge(alpha=1)
rf = RandomForestRegressor(n_estimators=5, random_state=42)
xgboost = XGBRegressor(learning_rate=0.01, n_estimators=3460,
                                     max_depth=3, min_child_weight=0,
                                     gamma=0, subsample=0.7,
                                     colsample_bytree=0.7,
                                     objective='reg:linear', nthread=-1,
                                     scale_pos_weight=1, seed=27,
                                     reg_alpha=0.00006)
stack = StackingCVRegressor(regressors=(ridge, regressor, svr, lasso, rf), meta_regressor=regressor, random_state=42);
stack.fit(X_train, Y_train);