Titanic问题预测存活(1)

本文针对Kaggle上面的Titanic问题进行了入门学习，搭建了一个简单的xgboost基础模型，达到了75%的精度。

预处理

Titanic的数据中共有12列属性，包括passengerId:乘客ID，Survived:存活情况，Pclass:船票级别，Name:乘客姓名, Sex:性别, Age:年龄, SibSp:船上的兄弟姐妹以及配偶的人数, Parch:船上的父母以及子女的人数, Ticket:船票编号, Fare:工资, Cabin:所在船舱, Embarked:登船的港口。xgboost的输入需要是数值型数据，所以这里需要将非数值类型转换为数值类型，另外xgboost其实可以处理缺失数据，本不必要对缺失数据做处理。每个属性的预处理代码如下：

def pre_process_passengerId(df):
    return df.drop(['PassengerId'], axis=1)

def pre_process_pclass(df):
    return df; 

# 预处理用户名字，暂时认为名字和是否存活没有关系，删除该列
def pre_process_name(df):
    return df.drop(['Name'], axis=1);

#预处理性别，改为01格式
def pre_process_sex(df):
    df["Sex1"] = 0 
    df.Sex1[df.Sex=='female'] = 1;
    return df.drop(['Sex'], axis=1);

#使用平均数来填充age中缺失的值
def pre_process_age(df):
    df.loc[df.Age.isnull(), 'Age'] = 29.69
    return df

def pre_process_sibesp(df):
    return df

def pre_process_parch(df):
    return df

#Ticket暂时认为和是否存活无关，删除该列
def pre_process_ticket(df):
    return df.drop(['Ticket'], axis=1);

#工资使用众数填充缺失值
def pre_process_fare(df):
    df.Fare.fillna(df.Fare.mode()[0], inplace=True)
    return df;

#暂时认为cabin和是否存活无关，删除该列
def pre_process_cabin(df):
    return df.drop(["Cabin"], axis=1);

#将embark的字符串格式转换成0,1,2格式
def pre_process_embark(df):
    df.Embarked.fillna(df.Embarked.mode()[0], inplace=True)
    df["Embarked1"] = 0
    df.Embarked1[df.Embarked=='S'] = 0;
    df.Embarked1[df.Embarked=='C'] = 1;
    df.Embarked1[df.Embarked=='Q'] = 2;
    return df.drop(["Embarked"], axis=1);

Xgboost模型

def xgboostClassify(X_train, y_train, X_test, y_test):
    model = xgb.XGBClassifier(max_depth=10, sub_sample=0.1, colsample_btree=0.1, learning_rate=0.4, n_estimators=20);
    model.fit(X_train, y_train)
    return model

主流程

train_data = pd.read_csv('./data/train.csv')
test_data = pd.read_csv('./data/test.csv')
test_data["Survived"] = 0;
combined_data = train_data.append(test_data);

combined_data=pre_process_passengerId(combined_data);
combined_data=pre_process_pclass(combined_data);
combined_data=pre_process_name(combined_data);
combined_data = pre_process_sex(combined_data);
combined_data = pre_process_age(combined_data);
combined_data = pre_process_sibesp(combined_data);
combined_data = pre_process_parch(combined_data);
combined_data = pre_process_ticket(combined_data);
combined_data = pre_process_fare(combined_data);
combined_data = pre_process_cabin(combined_data);
combined_data = pre_process_embark(combined_data);
print combined_data.info()

train_data = combined_data[:891];
test_data = combined_data[891:];

X_train = train_data.drop(['Survived'],axis=1)
Y_train = train_data['Survived']
model = xgboostClassify(X_train, Y_train, None, None)

X_output = test_data.drop(['Survived'], axis=1)
#输出预测的submission.csv文件
output(model, X_output)

预测结果

提交submission.csv到Kagge上，达到了75%的精度。

后续优化

该代码只是简单的跑通了流程，后续还有很多的优化工作要做。

xgboost调参，使用交叉验证寻找最优参数。
探索使用上面没用到的特征，及已用到特征的优化