snappycat: A command line tool to decompress snappy files produced by Hadoop

By Yong-Siang Shih / Fri 06 November 2015 / In categories Projects

Hadoop, Spark, snappy, snappycat

I often encounter Snappy-compressed files recently when I am learning Spark. Although we could just use sc.textFile to read them in Spark, sometimes we might want to download them locally for processing. However, reading these files locally is complicated because the file format is not exactly Snappy-compressed files, as Hadoop stores those files in its own way.

Most of existing solutions use Java to link with Hadoop library, but the setup is rather complicated. Moreover, some tools don’t support empty files. Therefore, I spent some time to study the file format.

In short, Hadoop split the files into multiple blocks, and each block is compressed with Snappy independently. Before each compressed block, two 32-bit number are used to represent the decompressed size and the compressed size, respectively.

As Spark split files into multiple partitions, some partitions might be empty. In such cases, the files would only contain two 32-bit 0s.

I developed a short C++ program to handle these cases: snappycat.

The usage is simple, just use input files as arguments:

./snappycat DIRECTORY/*.snappy

It also supports standard input:

cat DIRECTARY/*.snappy | snappycat

The program outputs the decompressed result to standard output. So to save the output, use:

./snappycat DIRECTORY/*.snappy > output.txt
Yong-Siang Shih

Author

Yong-Siang Shih

Software Engineer, Machine Learning Scientist, Open Source Enthusiast. Worked at Appier building machine learning systems, and interned at Google, IBM, and Microsoft as software engineering intern. Love to learn and build things.

Load Disqus Comments