I often encounter Snappy-compressed files recently when I am learning
Spark. Although we could just use
sc.textFile to read them in Spark,
sometimes we might want to download them locally for processing. However,
reading these files locally is complicated because the file format is not
exactly Snappy-compressed files, as Hadoop stores those files in its own way.
Most of existing solutions use Java to link with Hadoop library, but the setup is rather complicated. Moreover, some tools don’t support empty files. Therefore, I spent some time to study the file format.
In short, Hadoop split the files into multiple blocks, and each block is compressed with Snappy independently. Before each compressed block, two 32-bit number are used to represent the decompressed size and the compressed size, respectively.
As Spark split files into multiple partitions, some partitions might be empty. In such cases, the files would only contain two 32-bit 0s.
I developed a short C++ program to handle these cases: snappycat.
The usage is simple, just use input files as arguments:
It also supports standard input:
cat DIRECTARY/*.snappy | snappycat
The program outputs the decompressed result to standard output. So to save the output, use:
./snappycat DIRECTORY/*.snappy > output.txt