Using Caffe for Sentiment Analysis

By Shaform, Sat 06 June 2015, in category Notes

Caffe, deep learning, sentiment analysis, word2vec

Caffe is a deep learning framework that can be used to develop neural network models. Although Caffe is usually used for image classification, it does not prevent us from utilizing it for other tasks. In this article, we outline the procedure to convert Paragraph Vectors into the LMDB format that Caffe understands, and create a simple model to train and predict the sentiment for movie reviews.

Data Preparation

While in the previous post we use custom Chinese corpus for sentiment analysis, this time we utilize the scripts provided by mesnilgr/iclr15 to download the Large Movie Review Dataset so it's easier to reproduce the results.

# get mesnilgr/iclr15
git clone https://github.com/mesnilgr/iclr15

mkdir -p iclr15_run
cd iclr15_run

# get data
../iclr15/scripts/data.sh

Afterwards, we create the Paragraph Vectors for each review. Paragraph Vectors are fixed-dimensional distributed representations for texts. Once we convert each review into a vector, we could easily feed it into a neural network.

# extract the part to create paragraph vectors from iclr15 scripts
sed -e '/liblinear/,$d' ../iclr15/scripts/paragraph.sh > paragraph.sh

# start creating the vectors
chmod +x paragraph.sh
./paragraph.sh

Finally, we copy the resulting files.

# copy the vectors
cd word2vec
cp full-train.txt test.txt ../../
cd ../../

Converting the Input Format

We now have two files: full-train.txt and test.txt for training and testing respectively. These files use LIBSVM data format, which can not be used with Caffe directly. Therefore, we create a script to convert the files.

We will use utilities provided by Pycaffe to do the conversion; be sure to install all the dependencies and Pycaffe itself. Notice that currently Pycaffe does not work well on Python 3, so we'll use Python 2.7 here. If you don't want to install Pycaffe system-wide. You could also manually set the PYTHONPATH variable as follows.

export PYTHONPATH=${PYTHONPATH}:caffe-directory/python/

In addition, install the py-lmdb Python package.

sudo pip install lmdb

Each line in the input file begins with a label followed by a 100-dimensional array. So we extract the data using a simple Python routine.

import random

import numpy as np

num_of_dims = 100

def load_data(path):
    items = []
    with open(path) as f:
        for l in f:
            tokens = l.rstrip().split()
            label = int(tokens[0])
            # change label `-1' to `0'
            if label == -1:
                label = 0
            # ignore the index since we already know the format
            arr = [float(dim.split(':')[1]) for dim in tokens[1:]]
            items.append((label, arr))

    random.shuffle(items)

    Y = np.array([y for y, _ in items])
    X = np.array([x for _, x in items])
    X = X.reshape((len(Y), 1, 1, num_of_dims))

    return X, Y

There is a well-written tutorial on creating LMDB file on Creating an LMDB database in Python, and we'll adopt a similar procedure. The only difference is that we are using floats for the data, so we'll just use array_to_datum to create the Datum for us.

import lmdb
from caffe.io import array_to_datum

def save_data(path, X, Y): 
    num = np.prod(X.shape)
    itemsize = np.dtype(X.dtype).itemsize
    # set a reasonable upper limit for database size
    map_size = 10240 * 1024 + num * itemsize * 2 
    print 'save {} instances...'.format(num)

    env = lmdb.open(path, map_size=map_size)

    for i, (x, y) in enumerate(zip(X, Y)):
        datum = array_to_datum(x, y)
        str_id = '{:08}'.format(i)

        with env.begin(write=True) as txn:
            txn.put(str_id, datum.SerializeToString())

Using the complete convert.py script, we convert both training and testing files into LMDB format.

python convert.py full-train.txt movie-train-lmdb
python convert.py test.txt movie-test-lmdb

Creating a Caffe Model

Finally, we create a simple NN model with nn.prototxt and nn_solver.prototxt. Execute the Caffe command line tool and we obtain the following results.

$ caffe train --solver=nn_solver.prototxt

Iteration 10000, loss = 0.142478
Iteration 10000, Testing net (#0)
    Test net output #0: accuracy = 0.88364
    Test net output #1: loss = 0.284636 (* 1 = 0.284636 loss)
Optimization Done.

Source Code

The relevant source code is on shaform/experiments/caffe_sentiment_analysis.