Caffe is a deep learning framework that can be used to develop neural network models. Although Caffe is usually used for image classification, it does not prevent us from utilizing it for other tasks. In this article, we outline the procedure to convert Paragraph Vectors into the LMDB format that Caffe understands, and create a simple model to train and predict the sentiment for movie reviews.
Data Preparation
While in the previous post we use custom Chinese corpus for sentiment analysis, this time we utilize the scripts provided by mesnilgr/iclr15 to download the Large Movie Review Dataset so it’s easier to reproduce the results.
# get mesnilgr/iclr15
git clone https://github.com/mesnilgr/iclr15
mkdir -p iclr15_run
cd iclr15_run
# get data
../iclr15/scripts/data.sh
Afterwards, we create the Paragraph Vectors for each review. Paragraph Vectors are fixed-dimensional distributed representations for texts. Once we convert each review into a vector, we could easily feed it into a neural network.
# extract the part to create paragraph vectors from iclr15 scripts
sed -e '/liblinear/,$d' ../iclr15/scripts/paragraph.sh > paragraph.sh
# start creating the vectors
chmod +x paragraph.sh
./paragraph.sh
Finally, we copy the resulting files.
# copy the vectors
cd word2vec
cp full-train.txt test.txt ../../
cd ../../
Converting the Input Format
We now have two files: full-train.txt
and test.txt
for training and testing
respectively. These files use LIBSVM data format, which can
not be used with Caffe directly. Therefore, we create a script to convert the files.
We will use utilities provided by Pycaffe to do the conversion; be sure to install
all the dependencies and Pycaffe itself. Notice that currently Pycaffe does not work
well on Python 3, so we’ll use Python 2.7 here. If you don’t want to install Pycaffe
system-wide. You could also manually set the PYTHONPATH
variable as follows.
export PYTHONPATH=${PYTHONPATH}:caffe-directory/python/
In addition, install the py-lmdb Python package.
sudo pip install lmdb
Each line in the input file begins with a label followed by a 100-dimensional array. So we extract the data using a simple Python routine.
import random
import numpy as np
num_of_dims = 100
def load_data(path):
items = []
with open(path) as f:
for l in f:
tokens = l.rstrip().split()
label = int(tokens[0])
# change label `-1' to `0'
if label == -1:
label = 0
# ignore the index since we already know the format
arr = [float(dim.split(':')[1]) for dim in tokens[1:]]
items.append((label, arr))
random.shuffle(items)
Y = np.array([y for y, _ in items])
X = np.array([x for _, x in items])
X = X.reshape((len(Y), 1, 1, num_of_dims))
return X, Y
There is a well-written tutorial on creating LMDB file on
Creating an LMDB database in Python, and we’ll adopt
a similar procedure. The only difference is that we are using floats
for the data, so we’ll just use array_to_datum
to create the Datum
for us.
import lmdb
from caffe.io import array_to_datum
def save_data(path, X, Y):
num = np.prod(X.shape)
itemsize = np.dtype(X.dtype).itemsize
# set a reasonable upper limit for database size
map_size = 10240 * 1024 + num * itemsize * 2
print 'save {} instances...'.format(num)
env = lmdb.open(path, map_size=map_size)
for i, (x, y) in enumerate(zip(X, Y)):
datum = array_to_datum(x, y)
str_id = '{:08}'.format(i)
with env.begin(write=True) as txn:
txn.put(str_id, datum.SerializeToString())
Using the complete convert.py script, we convert both training and testing files into LMDB format.
python convert.py full-train.txt movie-train-lmdb
python convert.py test.txt movie-test-lmdb
Creating a Caffe Model
Finally, we create a simple NN model with nn.prototxt and nn_solver.prototxt. Execute the Caffe command line tool and we obtain the following results.
$ caffe train --solver=nn_solver.prototxt
Iteration 10000, loss = 0.142478
Iteration 10000, Testing net (#0)
Test net output #0: accuracy = 0.88364
Test net output #1: loss = 0.284636 (* 1 = 0.284636 loss)
Optimization Done.
Source Code
The relevant source code is on shaform/experiments/caffe_sentiment_analysis.
