本文继续根据Kaggle的Bag of Words Meets Bags of Popcorn问题探讨NLP的入门技术,前一篇文章介绍了使用CountVectorizer和TfidfVectorizer来对评论进行编码,本文将尝试使用Word2Vec的方法来对评论进行编码。
Word2Vec预处理
Word2Vec模型的训练语料需要是句子,所以我们首先需要将评论内容分离成一个个句子,这里使用nltk.tokenizer来分割句子。另外对句子仍然需要做去除HTML标签、去除特殊字符、去除停用词等。
1 | def review_to_sentences(review, tokenizer): |
Word2Vec模型训练
前一节准备好句子数据后,我们就可以训练Word2Vec模型,我们设定单词的嵌入式向量维度为300。 1
2
3
4
5
6
7
8
9
10
11
12
13num_features = 300 # Word vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words
# Initialize and train the model (this will take some time)
print("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, size=num_features, min_count = min_word_count, window = context, sample
= downsampling)
model.init_sims(replace=True)
model_name = "300features_40minwords_10context"
model.save(model_name)
使用Word2Vec模型进行编码
本节使用上面训练好的Word2Vec模型对评论中的每个单词进行编码(转换为300维的向量),然后对向量做平均,最后每条评论对应一个300维的向量。 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46def makeFeatureVec(words, model, num_features):
featureVec = np.zeros((num_features,),dtype="float32")
nwords = 0.
index2word_set = set(model.wv.vocab)
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec, model[word])
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec
def getAvgFeatureVecs(reviews, model, num_features):
counter = 0
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
for review in reviews:
if counter%1000. == 0.:
print("Review %d of %d" % (counter, len(reviews)))
featureVec = makeFeatureVec(review, model, num_features)
#print(featureVec)
reviewFeatureVecs[counter] = featureVec
#print(reviewFeatureVecs[counter])
counter = counter + 1
return reviewFeatureVecs
train = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv( "data/testData.tsv", header=0, delimiter="\t", quoting=3 )
model = Word2Vec.load("./300features_40minwords_10context")
#print(model.most_similar("man"))
print(model);
clean_train_reviews = []
for review in train["review"]:
clean_train_reviews.append(review_to_words(review))
trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)
print(pd.DataFrame(trainDataVecs).info())
print("Creating average feature vecs for test reviews")
clean_test_reviews = []
for review in test["review"]:
clean_test_reviews.append( review_to_words( review))
testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )
print(pd.DataFrame(testDataVecs).info())
模型训练和预测
简单使用随机森林模型进行训练和预测。 1
2
3
4
5
6
7
8
9print("Training the random forest...")
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( trainDataVecs, train["sentiment"] )
print("Ending Training")
# Get a bag of words for the test set, and convert to a numpy array
result = forest.predict(testDataVecs)
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "submission.csv", index=False, quoting=3 )