背着锄头的互联网农民

0%

Kaggle:电影评论情感分析(Bag of Words Meets Bags of Popcorn)(2)

本文继续根据Kaggle的Bag of Words Meets Bags of Popcorn问题探讨NLP的入门技术,前一篇文章介绍了使用CountVectorizer和TfidfVectorizer来对评论进行编码,本文将尝试使用Word2Vec的方法来对评论进行编码。

Word2Vec预处理

Word2Vec模型的训练语料需要是句子,所以我们首先需要将评论内容分离成一个个句子,这里使用nltk.tokenizer来分割句子。另外对句子仍然需要做去除HTML标签、去除特殊字符、去除停用词等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def review_to_sentences(review, tokenizer):
#使用标点符号分割句子
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
for raw_sentence in raw_sentences:
if len(raw_sentence) > 0:
sentences.append(review_to_words(raw_sentence))
return sentences

def review_to_words(raw_review):
#处理网页信息
review_text = BeautifulSoup(raw_review).get_text()
#替换非字母字符为空格
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
words = letters_only.lower().split()
#移除其中的停用词
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
return meaningful_words

train = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv( "data/testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv( "data/unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

sentences = [] # Initialize an empty list of sentences
print("Parsing sentences from training set")
for review in train["review"]:
sentences += review_to_sentences(review, tokenizer)

print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
sentences += review_to_sentences(review, tokenizer)
Word2Vec模型训练

前一节准备好句子数据后,我们就可以训练Word2Vec模型,我们设定单词的嵌入式向量维度为300。

1
2
3
4
5
6
7
8
9
10
11
12
13
num_features = 300    # Word vector dimensionality                      
min_word_count = 40 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words

# Initialize and train the model (this will take some time)
print("Training model...")
model = word2vec.Word2Vec(sentences, workers=num_workers, size=num_features, min_count = min_word_count, window = context, sample
= downsampling)
model.init_sims(replace=True)
model_name = "300features_40minwords_10context"
model.save(model_name)

使用Word2Vec模型进行编码

本节使用上面训练好的Word2Vec模型对评论中的每个单词进行编码(转换为300维的向量),然后对向量做平均,最后每条评论对应一个300维的向量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def makeFeatureVec(words, model, num_features):
featureVec = np.zeros((num_features,),dtype="float32")
nwords = 0.
index2word_set = set(model.wv.vocab)
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec, model[word])
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec

def getAvgFeatureVecs(reviews, model, num_features):
counter = 0
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
for review in reviews:
if counter%1000. == 0.:
print("Review %d of %d" % (counter, len(reviews)))
featureVec = makeFeatureVec(review, model, num_features)
#print(featureVec)
reviewFeatureVecs[counter] = featureVec
#print(reviewFeatureVecs[counter])
counter = counter + 1
return reviewFeatureVecs

train = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
test = pd.read_csv( "data/testData.tsv", header=0, delimiter="\t", quoting=3 )

model = Word2Vec.load("./300features_40minwords_10context")
#print(model.most_similar("man"))
print(model);

clean_train_reviews = []
for review in train["review"]:
clean_train_reviews.append(review_to_words(review))

trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features)
print(pd.DataFrame(trainDataVecs).info())

print("Creating average feature vecs for test reviews")
clean_test_reviews = []
for review in test["review"]:
clean_test_reviews.append( review_to_words( review))

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )
print(pd.DataFrame(testDataVecs).info())

模型训练和预测

简单使用随机森林模型进行训练和预测。

1
2
3
4
5
6
7
8
9
print("Training the random forest...")
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( trainDataVecs, train["sentiment"] )
print("Ending Training")

# Get a bag of words for the test set, and convert to a numpy array
result = forest.predict(testDataVecs)
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "submission.csv", index=False, quoting=3 )