用 rnnlm, liblinear 和 word2vec 做中文情感分析測試

最近在研究一些 NLP 工具的使用方法，所以稍微紀錄一下心得。這次的實驗主要是參考 Tomas Mikolov 在 word2vec 論壇上發表的文章，以及 Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews 的程式碼。不過改運用在中文上，並且用簡單的 TF-IDF 取代 Naive Bayes Support Vector Machine。

資料集是使用 2014 年 NTU NLP 課程的旅館評論資料 207884_hotel_training.txt，然後稍微進行一些處理。將資料修改成以下格式：

LABEL TOKENS

每行有一則資料，先是正負向標記：1 表示正、2 表示負。其後接上斷詞後的評論（以空白分隔每個詞）。過濾掉斷詞失敗的句子後，共剩下 1124 句正向評論和 1217 句負向評論。

接下來我寫了一個 split.py 把資料分成測試集和訓練集：

python3 split.py --input data/data.txt --train_pos data/train_pos.txt --train_neg data/train_neg.txt --test_pos data/test_pos.txt --test_neg data/test_neg.txt

我分別取了十分之一當作測試資料：有 112 句正向、121 句負向。

RNNLM

首先，我參考了 mesnilgr/iclr15 的做法，使用 rnnlm 來建造語言模型，並預測測試資料的正負。rnnlm 是一個可以用來建立 Recurrent Neural Network Language Models 的方便工具。

首先分別挑出 200 則訓練資料當作幫助調整參數的 validation set，分別訓練出正向與負向語言模型：

cd rnnlm

# construct positive language model
head -n 200 ../data/train_pos.txt > val.txt
cat ../data/train_pos.txt | sed '1,200d' > train.txt
./rnnlm -rnnlm pos.model -train train.txt -valid val.txt -hidden 50 -direct-order 3 -direct 200 -class 100 -debug 2 -bptt 4 -bptt-block 10 -binary

# construct negative language model
head -n 200 ../data/train_neg.txt > val.txt
cat ../data/train_neg.txt | sed '1,200d' > train.txt
./rnnlm -rnnlm neg.model -train train.txt -valid val.txt -hidden 50 -direct-order 3 -direct 200 -class 100 -debug 2 -bptt 4 -bptt-block 10 -binary

緊接著，把測試資料串在一起，並標記 ID 以符合 rnnlm 的輸入格式：

cat ../data/test_pos.txt ../data/test_neg.txt | nl -v0 -s' ' -w1 > test.txt

最後再用正向與負向模型預測每個句子個可能性，並將其比值輸出到檔案中：

./rnnlm -rnnlm pos.model -test test.txt -debug 0 -nbest > model_pos_score.txt
./rnnlm -rnnlm neg.model -test test.txt -debug 0 -nbest > model_neg_score.txt
mkdir ../scores
paste model_pos_score.txt model_neg_score.txt | awk '{print $1/$2;}' > ../scores/RNNLM

再稍微用 normalize.py 調整數據的範圍，就可以用 evaluate.py 來檢查最後的正確率了。

cd ..
python3 normalize.py --input scores/RNNLM --output scores/RNNLM --type rnnlm
python3 evaluate.py --test_pos data/test_pos.txt --scores scores/RNNLM
#
# RNNLM accuracy: 87.9828%

Word2Vec: Paragraph Vectors + Logistic Regression

接下來我們要用具有將整個句子轉成 vector 能力的修改版 word2vec 來處理句子。值得注意的是，iclr15 中的 word2vec 版本，在句子數目太多時，會因為 vocabulary 數量太大，而造成許多句子被丟棄而無法正確轉成 embedding。

為了解決這個問題，我修改了程式碼 word2vec@shaform，讓我們可以用 @@SE 為開頭標示用來訓練 word embeddings 而不需要產生 paragraph vectors 的句子；以 @@SS 為開頭來標示需要產生 paragraph vectors 的句子。這樣一來就可以同時擁有大量的訓練資料，又可以完整產生所有指定的 paragraph vectors。不過在這個實驗中，為了方便，我們就不額外引進大量訓練資料，所以所有句子都會以 @@SS 開頭。

cd word2vec
cat ../data/train_pos.txt ../data/train_neg.txt ../data/test_pos.txt ../data/test_neg.txt | nl -v0 -s' ' -w1 | sed 's/^/@@SS-/' | shuf > all.txt
time ./word2vec -train all.txt -output vectors.txt -cbow 0 -size 400 -window 10 -negative 5 -hs 1 -sample 1e-3 -threads 24 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
grep '@@SS-' vectors.txt | sed -e 's/^@@SS-//' | sort -n > sentence_vectors.txt

緊接著用 transform.py 和 train.py 來產生訓練資料，並用 liblinear 的 logistic regression 來訓練：

python3 ../transform.py --input sentence_vectors.txt --output sentence_features.txt
python3 ../train.py --features sentence_features.txt --train_pos ../data/train_pos.txt --train_neg ../data/train_neg.txt --test_pos ../data/test_pos.txt --output_train train.txt --output_test test.txt
../liblinear/train -s 0 train.txt model.logreg
../liblinear/predict -b 1 test.txt model.logreg out.logreg

再稍微用 normalize.py 調整數據的範圍，就可以用 evaluate.py 來檢查最後的正確率了。

sed '1d' out.logreg | cut -d' ' -f3 > ../scores/DOC2VEC
cd ..
python3 normalize.py --input scores/DOC2VEC --output scores/DOC2VEC --type logreg
python3 evaluate.py --test_pos data/test_pos.txt --scores scores/DOC2VEC
#
# DOC2VEC accuracy: 84.5494%

TF-IDF

最後則是簡單的 TF-IDF，我使用 tfidf.py 來產生 unigrams 和 bigrams，並用之前的 train.py 來產生訓練資料。

cd tfidf
cat ../data/train_pos.txt ../data/train_neg.txt ../data/test_pos.txt ../data/test_neg.txt > all.txt
python3 ../tfidf.py --input all.txt --output features.txt
python3 ../train.py --features features.txt --train_pos ../data/train_pos.txt --train_neg ../data/train_neg.txt --test_pos ../data/test_pos.txt --output_train train.txt --output_test test.txt

緊接著用 liblinear 的 logistic regression 來訓練：

../liblinear/train -s 0 train.txt model.logreg
../liblinear/predict -b 1 test.txt model.logreg out.logreg

再稍微用 normalize.py 調整數據的範圍，就可以用 evaluate.py 來檢查最後的正確率了。

sed '1d' out.logreg | cut -d' ' -f3 > ../scores/TFIDF
cd ..
python3 normalize.py --input scores/TFIDF --output scores/TFIDF --type logreg

python3 evaluate.py --test_pos data/test_pos.txt --scores scores/TFIDF

# TFIDF accuracy: 90.9817%

整合

最後用簡單的算術平均整合 3 個模型，效能倒是沒有顯著上升：

paste scores/RNNLM scores/DOC2VEC scores/TFIDF | awk '{print ($1+$2+$3)/3;}' > scores/TOTAL
python3 evaluate.py --test_pos data/test_pos.txt --scores scores/TOTAL


# RNNLM accuracy: 87.9828%
# DOC2VEC accuracy: 84.5494%
# TFIDF accuracy: 90.9817%
# TOTAL accuracy: 90.1288%

程式碼

我把相關的程式碼放在 GitHub 上面供參考：shaform/sentiment_analysis。