May 22, 2015

Which gzip compression level for FASTQ files?

While implementing a pretty simple filter tool for gzipped FASTQ files, I noticed that the tool was much slower than expected. Profiling revealed that writing the gzipped stream with zlib was the bottleneck. The problem was that the default compression level, i.e. 6, is quite slow. Searching the web, I did not find any clues about which combination of compression level/strategy of gzip yields the optimal tradeoff between file size and speed for FASTQ files.

Thus, I did a little benchmarking. Here are the results for reading and compressing a 660MB FASTQ file on my laptop:

compression strategycompression leveltime [s]size[MB]
Z_FILTERED123.98205.73
Z_FILTERED227.21197.48
Z_FILTERED339.46187.54
Z_FILTERED446.50179.35
Z_FILTERED565.69174.62
Z_FILTERED6124.77167.14
Z_FILTERED7182.67164.17
Z_FILTERED8318.48161.31
Z_FILTERED9464.42160.29
Z_HUFFMAN_ONLY1-919.14281.14
Z_RLE1-921.04241.57
Z_FIXED124.34271.59
Z_FIXED226.67254.54
Z_FIXED339.42233.71
Z_FIXED432.91232.42
Z_FIXED560.27219.65
Z_FIXED6121.26206.17
Z_FIXED7180.66202.07
Z_FIXED8313.01200.06
Z_FIXED9456.23199.35

The result is pretty clear. You probably want to use the default strategy (Z_FILTERED) with the lowest compression (1). If the file size is very important to you, compression levels up to 4 might be considered without too much run time increase.

Note: The runtime and file size for the Linux command line tool 'gzip' are similar. So, if you compress FASTQ with it, don't forget to add the '-1' argument for best speed.

No comments:

Post a Comment