linuxconsole.net

back to main page                                                    txt version

======================
data compression utils
======================

Benchmarks of various data compression tools

Index:
* results and key observations
* tested programs and results for each command
* result files



results and key observations
============================

system used : link
methodology : link

Links contain top results in each category for datasets excluding 06 and 0b-0f
due to poor comperssion ratios.


fast

top results:
1. zstd -3 --long=31
2. zstd -8 --long=31
3. pigz -6

- zstd killer feature: '--long=31'; it makes repetitive data highly compressed
  at the cost of higher memory usage, but it will still work on machines with at
  least 8 GB of RAM

- 'zstd -3' is great when extremely fast (NVMe PCIe 4.0 level fast) compression
  is required

- pigz offers comparable compression times to 'zstd -3 --long=31', but mostly
  with much worse compression ratios


balanced

top results:
1. lrzip -9
1. zstd -17 --long=31
1. pixz -9  /  pixz -8
1. plzip -9  /  plzip -6

- no obvious winner: pixz and lzip beat zstd most of the time in this category,
  but zstd beats them when the data is highly repetitive ('--long=31'); lrzip
  sometimes takes a very long time to decompress data and has high memory
  usage, but on the other hand has the best compression ratios

- 7z and xz were excluded from this category due to very slow decompression

- pbzip2 is doing surprisingly well when compressing non-repetitive text data

- xz offers similar compression ratios and times to pixz, but doesn't have
  multi-threaded decompression yet, and this affected decompression times
  heavily


best

top results:
1. lrzip --zpaq  /  lrzip -L 9
2. zstd -22 --long=31 --ultra  /  zstd -20 --long=31 --ultra
2. pixz -9  /  xz -9e
2. 7z a ... -mfb=279 -md=256m  / 7z a -t7z -m0=lzma2 -mx=9

- 'lrzip --zpaq' is absolutely the best in every data set, but compression times
  are highest and decompression times are the same as compression times;
  'lrzip -9' doesn't have these drawbacks and compression ratios are comparable

- zstd didn't compress 'cs: go' as well as other utils

- lrzip and zstd with '--long=31' have no competition when compressing highly
  repetitive data, like datasets 05 and 07

- 'zstd -20 --long=31' is often twice as fast as 'zstd -22 --long=31', but
  compresses data up to 1 percentage point less, and the same can be said by
  comparing 'zstd -17 --long=31' to 'zstd -20 --long=31'

- 'pigz -11' is terribly ineffective (cpu time is 8x higher than the next
  slowest tool), so I stopped testing it after the first dataset

- 'plzip -9 -s 256MiB -m 273' has way too high memory usage, so it was excluded
  from the top rankings


other observations:

- lrzip has stability issues; this opinion is based on my previous experience
  using this tool, when I encountered bugs when compressing and/or
  decompressing; this was few years back, so some of the bugs are probably
  resolved, but during the benchmarking I also encountered this bug:
  https://github.com/ckolivas/lrzip/issues/102



tested programs and results for each command
============================================

Links contain benchmark results for programs' various settings performed on one
set of text data and one set of binary data.

Main focus of this benchmark is on multi-threaded programs.

- multi-threaded
          version
pigz      2.4
pbzip2    1.1.13
xz        5.2.5
pixz      1.0.7
plzip     1.8
lrzip     0.631
zstd      1.4.5     * --long=31: link
7z        16.02

- single-threaded
          version
gzip      1.10
bzip2     1.0.8
lzip      1.21
lz4       1.9.2
lzop      1.0.4
brotli    1.0.9



result files
============

How to read result tables: link.

Results for each data set:

binary
00 - qcow2 image of a fresh installation of arch linux, 12GB
01 - qcow2 image of a fresh installation of windows 7, 18GB
02 - /usr/bin directory of a system with a lot of bloat, 2GB
03 - cleaned /usr/lib directory of a system with a lot of bloat, 15GB
04 - installed counter-strike global offensive, 24GB
05 - old dos games, 14GB
06 - a bunch of bencoded torrent files, 3GB

text
07 - linux source code versions 5.9.1, 5.8.11 and 5.8.5, 3GB
08 - dump of passwords found on the internet, 10GB
09 - parsed information from torrent files, 21GB
0a - cat of e-books converted to txt from the Gutenberg Project, 10GB

multimedia (just as an experiment to see compression ratios)
0b - TIFF images from Hubble space telescope, 2GB
0c - JPG images with paintings from various sources, 2GB
0d - a bunch of mp3 music files from various sources, 3GB
0e - varioius PDF files containing books, 3GB
0f - video files from various sources, 5GB

The above files do not contain benchmarks for zstd with a setting '--long=31'.
These are provided separately here. Compared to '--long=30' significant
differences exist only for dataset 00.
====================== data compression utils ======================

results and key observations ============================

tested programs and results for each command ============================================

result files ============