Compressing PGN with gzip, lz4, xz, bzip2, Brotli, and Zstandard
Available benchmarks give a good idea about the characteristics of general purpose compression tools, but details may depend on the specifics of the corpus.
Here is how they perform on large PGNs exported from the Lichess database.
Plaintext
The plaintext was prepared by decompressing the September 2022 games from database.lichess.org, taking the first GiB, and removing the incomplete game at the end.
There are 467393 games remaining, and the last looks like:
[Event "Rated Bullet tournament https://lichess.org/tournament/aOKoO6S6"]
[Site "https://lichess.org/jslt3qqY"]
[Date "2022.09.01"]
[Round "-"]
[White "anwarsoerbakti"]
[Black "gun_pwk"]
[Result "1-0"]
[UTCDate "2022.09.01"]
[UTCTime "06:21:42"]
[WhiteElo "2200"]
[BlackElo "2009"]
[WhiteRatingDiff "+3"]
[BlackRatingDiff "-3"]
[ECO "B23"]
[Opening "Sicilian Defense: Closed"]
[TimeControl "120+0"]
[Termination "Time forfeit"]
1. e4 { [%clk 0:02:00] } 1... c5 { [%clk 0:02:00] } 2. Nc3 { [%clk 0:02:00] } 2... g6 { [%clk 0:01:59] } 3. Bc4 { [%clk 0:01:59] } 3... Bg7 { [%clk 0:01:58] } 4. Nf3 { [%clk 0:01:59] } 4... e6 { [%clk 0:01:57] } 5. Qe2 { [%clk 0:01:59] } 5... Ne7 { [%clk 0:01:57] } 6. d3 { [%clk 0:01:58] } 6... O-O { [%clk 0:01:55] } 7. Bg5 { [%clk 0:01:57] } 7... Nbc6 { [%clk 0:01:55] } 8. h3 { [%clk 0:01:55] } 8... a6 { [%clk 0:01:51] } 9. Qd2 { [%clk 0:01:54] } 9... b5 { [%clk 0:01:48] } 10. Bb3 { [%clk 0:01:53] } 10... Bb7 { [%clk 0:01:47] } 11. a4 { [%clk 0:01:51] } 11... b4 { [%clk 0:01:46] } 12. Nd1 { [%clk 0:01:50] } 12... Nd4 { [%clk 0:01:44] } 13. Nxd4 { [%clk 0:01:49] } 13... Bxd4 { [%clk 0:01:43] } 14. h4 { [%clk 0:01:48] } 14... f6 { [%clk 0:01:37] } 15. Bh6 { [%clk 0:01:47] } 15... Rf7 { [%clk 0:01:34] } 16. h5 { [%clk 0:01:45] } 16... g5 { [%clk 0:01:33] } 17. f3 { [%clk 0:01:45] } 17... Be5 { [%clk 0:01:31] } 18. Nf2 { [%clk 0:01:43] } 18... Bf4 { [%clk 0:01:29] } 19. Qe2 { [%clk 0:01:38] } 19... Nc6 { [%clk 0:01:25] } 20. Nh3 { [%clk 0:01:33] } 20... Bg3+ { [%clk 0:01:22] } 21. Kd2 { [%clk 0:01:30] } 21... Nd4 { [%clk 0:01:21] } 22. Qe3 { [%clk 0:01:26] } 22... Be5 { [%clk 0:01:03] } 23. f4 { [%clk 0:01:22] } 23... gxf4 { [%clk 0:00:59] } 24. Bxf4 { [%clk 0:01:20] } 24... d5 { [%clk 0:00:55] } 25. Bxe5 { [%clk 0:01:14] } 25... fxe5 { [%clk 0:00:55] } 26. Ng5 { [%clk 0:01:07] } 26... Rg7 { [%clk 0:00:48] } 27. Nf3 { [%clk 0:00:56] } 27... Rxg2+ { [%clk 0:00:45] } 28. Kc1 { [%clk 0:00:52] } 28... Nxf3 { [%clk 0:00:41] } 29. Qxf3 { [%clk 0:00:46] } 29... Qg5+ { [%clk 0:00:40] } 30. Kb1 { [%clk 0:00:45] } 30... Rf8 { [%clk 0:00:39] } 31. Qh3 { [%clk 0:00:39] } 31... Qg4 { [%clk 0:00:28] } 32. Ka2 { [%clk 0:00:37] } 32... Qxh3 { [%clk 0:00:22] } 33. Rxh3 { [%clk 0:00:36] } 33... Rff2 { [%clk 0:00:21] } 34. Re1 { [%clk 0:00:32] } 34... Re2 { [%clk 0:00:14] } 35. Re3 { [%clk 0:00:29] } 35... Rxe1 { [%clk 0:00:11] } 36. Rxe1 { [%clk 0:00:29] } 36... dxe4 { [%clk 0:00:09] } 37. Bxe6+ { [%clk 0:00:27] } 37... Kg7 { [%clk 0:00:09] } 38. dxe4 { [%clk 0:00:25] } 38... Rxc2 { [%clk 0:00:07] } 39. Bb3 { [%clk 0:00:25] } 39... Rg2 { [%clk 0:00:06] } 40. Rc1 { [%clk 0:00:23] } 40... Kf6 { [%clk 0:00:05] } 41. Rxc5 { [%clk 0:00:22] } 41... Bxe4 { [%clk 0:00:05] } 42. Rc4 { [%clk 0:00:20] } 42... Kf5 { [%clk 0:00:04] } 43. Rxb4 { [%clk 0:00:18] } 43... Rg1 { [%clk 0:00:04] } 44. Rb8 { [%clk 0:00:14] } 44... Bb1+ { [%clk 0:00:04] } 45. Ka3 { [%clk 0:00:13] } 45... Be4 { [%clk 0:00:03] } 46. Kb4 { [%clk 0:00:12] } 46... Kg5 { [%clk 0:00:03] } 47. Rg8+ { [%clk 0:00:10] } 47... Kf4 { [%clk 0:00:03] } 48. Rxg1 { [%clk 0:00:10] } 48... Bf3 { [%clk 0:00:02] } 49. Rf1 { [%clk 0:00:09] } 49... e4 { [%clk 0:00:02] } 50. Bd1 { [%clk 0:00:07] } 50... Kg3 { [%clk 0:00:01] } 51. Bxf3 { [%clk 0:00:07] } 51... exf3 { [%clk 0:00:01] } 52. Rg1+ { [%clk 0:00:07] } 1-0
Method
These tests were performed on Debian 11 (bullseye), using an
Intel Xeon E5-1650 v2 CPU @ 3.50GHz. All files were situated in a tmpfs
.
Tool | Version |
---|---|
gzip | 1.10 |
pigz | 2.6 |
lz4 | 1.9.3 |
xz | 5.2.5 |
bzip2 | 1.0.8 |
pbzip2 | 1.1.13 |
brotli | 1.0.9 |
zstd | 1.4.8 |
pzstd | 1.4.8 |
Each measurement was averaged over 3 runs.
It seems reasonable to assume that people dealing with huge PGN files do not care much about nuances like peak memory usage (if it does not grow arbitrarily), so the only considered metrics are file size and wall clock time.
Where paralellism was available, single-threaded usage and 8 threads were measured.
pigz
offers parallel gzip compression and decompression.lz4
is intended for fast single-threaded usage.xz
offers parallel compression, but no parallel decompression as of now.pbzip2
offers parallel bzip2 compression and decompression. Parallel decompression is only effective on archives that have been created withpbzip2
in the first place.- Parallel Brotli seems possible by design, but no official tool is available as of now.
zstd
offers parallel compression, but no parallel decompression as of now.pzstd
offers parallel Zstandard compression and decompression. Parallel decompression is only effective on archives that have been created withpzstd
in the first place.
Ultimately, users may also want to consider other properties of the compression formats.
Compression
Decompression
Raw measurements
Compression | Run 1 | Run 2 | Run 3 | Avg Run | File size (bytes) | Ratio | Decompression | Run 1 | Run 2 | Run 3 | Avg Run |
---|---|---|---|---|---|---|---|---|---|---|---|
cat | 0.2s | 0.2s | 0.2s | 0.2s | 1073741782 | 1.000 | cat | 0.2s | 0.2s | 0.2s | 0.2s |
gzip | 38.5s | 38.7s | 38.6s | 38.6s | 215987804 | 4.971 | gunzip | 4.8s | 4.7s | 4.7s | 4.7s |
gzip -9 | 2:47.5s | 2:47.2s | 2:47.0s | 2:47.2s | 208532036 | 5.149 | gunzip | 4.6s | 4.7s | 4.7s | 4.6s |
pigz -p8 | 5.7s | 6.0s | 6.2s | 6.0s | 216195701 | 4.967 | unpigz -p8 | 2.7s | 2.7s | 2.7s | 2.7s |
pigz -p8 -9 | 22.8s | 23.5s | 23.5s | 23.2s | 208769656 | 5.143 | unpigz -p8 | 2.6s | 2.7s | 2.7s | 2.7s |
lz4 | 2.4s | 2.4s | 2.4s | 2.4s | 348615978 | 3.080 | unlz4 | 1.0s | 1.0s | 1.0s | 1.0s |
lz4 -12 | 1:44.0s | 1:44.5s | 1:45.8s | 1:44.8s | 239186622 | 4.489 | unlz4 | 0.9s | 0.9s | 0.8s | 0.9s |
lz4 -12 --favor-decSpeed | 1:48.7s | 1:52.7s | 1:48.9s | 1:50.1s | 242041411 | 4.436 | unlz4 | 0.8s | 0.9s | 0.8s | 0.8s |
xz | 9:17.8s | 9:19.0s | 9:19.9s | 9:18.9s | 140157324 | 7.661 | unxz | 9.1s | 9.1s | 9.1s | 9.1s |
xz -e | 11:10.8s | 11:12.1s | 11:11.3s | 11:11.4s | 136859080 | 7.846 | unxz | 9.0s | 9.1s | 9.1s | 9.1s |
xz -9 | 11:38.4s | 11:40.9s | 11:38.2s | 11:39.1s | 130723072 | 8.214 | unxz | 9.1s | 9.1s | 9.1s | 9.1s |
xz -9e | 14:43.9s | 14:41.5s | 14:43.1s | 14:42.8s | 126039588 | 8.519 | unxz | 9.2s | 9.2s | 9.2s | 9.2s |
xz -T8 | 1:46.0s | 1:45.9s | 1:45.9s | 1:45.9s | 141980000 | 7.563 | unxz | 9.3s | 9.3s | 9.3s | 9.3s |
xz -T8 -e | 2:06.5s | 2:06.3s | 2:06.4s | 2:06.4s | 138846784 | 7.733 | unxz | 9.3s | 9.3s | 9.3s | 9.3s |
xz -T8 -9 | 2:33.9s | 2:33.6s | 2:33.9s | 2:33.8s | 132108232 | 8.128 | unxz | 9.0s | 9.0s | 9.0s | 9.0s |
xz -T8 -9e | 3:10.4s | 3:10.9s | 3:11.0s | 3:10.8s | 127578672 | 8.416 | unxz | 9.2s | 9.2s | 9.2s | 9.2s |
bzip2 | 1:24.3s | 1:22.7s | 1:24.3s | 1:23.7s | 126744088 | 8.472 | bunzip2 | 23.0s | 24.6s | 22.9s | 23.5s |
pbzip2 -p8 | 15.7s | 16.7s | 15.7s | 16.0s | 126828692 | 8.466 | pbzip2 -d -p8 | 7.0s | 7.0s | 7.0s | 7.0s |
brotli | 38:40.7s | 38:41.3s | 38:37.9s | 38:40.0s | 134976003 | 7.955 | brotli -d | 2.6s | 2.6s | 2.6s | 2.6s |
zstd | 5.1s | 5.2s | 5.2s | 5.2s | 221787281 | 4.841 | unzstd | 2.0s | 1.9s | 1.9s | 1.9s |
zstd -19 | 12:18.5s | 12:17.7s | 12:17.2s | 12:17.8s | 147797800 | 7.265 | unzstd | 1.7s | 1.7s | 1.7s | 1.7s |
zstd --ultra -22 | 15:27.1s | 15:35.4s | 15:32.1s | 15:31.5s | 135909097 | 7.900 | unzstd | 1.8s | 1.7s | 1.7s | 1.8s |
zstd -T8 | 1.1s | 1.1s | 1.1s | 1.1s | 221787281 | 4.841 | unzstd | 2.0s | 1.9s | 1.9s | 1.9s |
zstd -T8 -19 | 2:27.7s | 2:25.5s | 2:27.1s | 2:26.8s | 147797800 | 7.265 | unzstd | 1.8s | 1.7s | 1.7s | 1.7s |
zstd -T8 --ultra -22 | 8:49.2s | 8:52.2s | 8:51.7s | 8:51.1s | 135909097 | 7.900 | unzstd | 1.8s | 1.7s | 1.7s | 1.8s |
pzstd -p8 | 1.1s | 1.1s | 1.1s | 1.1s | 222008897 | 4.836 | pzstd -d -p8 | 0.6s | 0.6s | 0.6s | 0.6s |
pzstd -p8 -19 | 1:55.8s | 1:50.6s | 1:54.0s | 1:53.5s | 149537117 | 7.180 | pzstd -d -p8 | 0.6s | 0.7s | 0.7s | 0.7s |
pzstd -p8 --ultra -22 | 7:01.0s | 6:54.1s | 6:49.9s | 6:55.0s | 136364543 | 7.874 | pzstd -d -p8 | 1.0s | 1.1s | 1.1s | 1.0s |
niklasf, 3rd November 2022.