Compressing PGN with gzip, lz4, xz, bzip2, Brotli, and Zstandard

Available benchmarks give a good idea about the characteristics of general purpose compression tools, but details may depend on the specifics of the corpus.

Here is how they perform on large PGNs exported from the Lichess database.

Plaintext

The plaintext was prepared by decompressing the September 2022 games from database.lichess.org, taking the first GiB, and removing the incomplete game at the end.

There are 467393 games remaining, and the last looks like:

[Event "Rated Bullet tournament https://lichess.org/tournament/aOKoO6S6"]
[Site "https://lichess.org/jslt3qqY"]
[Date "2022.09.01"]
[Round "-"]
[White "anwarsoerbakti"]
[Black "gun_pwk"]
[Result "1-0"]
[UTCDate "2022.09.01"]
[UTCTime "06:21:42"]
[WhiteElo "2200"]
[BlackElo "2009"]
[WhiteRatingDiff "+3"]
[BlackRatingDiff "-3"]
[ECO "B23"]
[Opening "Sicilian Defense: Closed"]
[TimeControl "120+0"]
[Termination "Time forfeit"]

1. e4 { [%clk 0:02:00] } 1... c5 { [%clk 0:02:00] } 2. Nc3 { [%clk 0:02:00] } 2... g6 { [%clk 0:01:59] } 3. Bc4 { [%clk 0:01:59] } 3... Bg7 { [%clk 0:01:58] } 4. Nf3 { [%clk 0:01:59] } 4... e6 { [%clk 0:01:57] } 5. Qe2 { [%clk 0:01:59] } 5... Ne7 { [%clk 0:01:57] } 6. d3 { [%clk 0:01:58] } 6... O-O { [%clk 0:01:55] } 7. Bg5 { [%clk 0:01:57] } 7... Nbc6 { [%clk 0:01:55] } 8. h3 { [%clk 0:01:55] } 8... a6 { [%clk 0:01:51] } 9. Qd2 { [%clk 0:01:54] } 9... b5 { [%clk 0:01:48] } 10. Bb3 { [%clk 0:01:53] } 10... Bb7 { [%clk 0:01:47] } 11. a4 { [%clk 0:01:51] } 11... b4 { [%clk 0:01:46] } 12. Nd1 { [%clk 0:01:50] } 12... Nd4 { [%clk 0:01:44] } 13. Nxd4 { [%clk 0:01:49] } 13... Bxd4 { [%clk 0:01:43] } 14. h4 { [%clk 0:01:48] } 14... f6 { [%clk 0:01:37] } 15. Bh6 { [%clk 0:01:47] } 15... Rf7 { [%clk 0:01:34] } 16. h5 { [%clk 0:01:45] } 16... g5 { [%clk 0:01:33] } 17. f3 { [%clk 0:01:45] } 17... Be5 { [%clk 0:01:31] } 18. Nf2 { [%clk 0:01:43] } 18... Bf4 { [%clk 0:01:29] } 19. Qe2 { [%clk 0:01:38] } 19... Nc6 { [%clk 0:01:25] } 20. Nh3 { [%clk 0:01:33] } 20... Bg3+ { [%clk 0:01:22] } 21. Kd2 { [%clk 0:01:30] } 21... Nd4 { [%clk 0:01:21] } 22. Qe3 { [%clk 0:01:26] } 22... Be5 { [%clk 0:01:03] } 23. f4 { [%clk 0:01:22] } 23... gxf4 { [%clk 0:00:59] } 24. Bxf4 { [%clk 0:01:20] } 24... d5 { [%clk 0:00:55] } 25. Bxe5 { [%clk 0:01:14] } 25... fxe5 { [%clk 0:00:55] } 26. Ng5 { [%clk 0:01:07] } 26... Rg7 { [%clk 0:00:48] } 27. Nf3 { [%clk 0:00:56] } 27... Rxg2+ { [%clk 0:00:45] } 28. Kc1 { [%clk 0:00:52] } 28... Nxf3 { [%clk 0:00:41] } 29. Qxf3 { [%clk 0:00:46] } 29... Qg5+ { [%clk 0:00:40] } 30. Kb1 { [%clk 0:00:45] } 30... Rf8 { [%clk 0:00:39] } 31. Qh3 { [%clk 0:00:39] } 31... Qg4 { [%clk 0:00:28] } 32. Ka2 { [%clk 0:00:37] } 32... Qxh3 { [%clk 0:00:22] } 33. Rxh3 { [%clk 0:00:36] } 33... Rff2 { [%clk 0:00:21] } 34. Re1 { [%clk 0:00:32] } 34... Re2 { [%clk 0:00:14] } 35. Re3 { [%clk 0:00:29] } 35... Rxe1 { [%clk 0:00:11] } 36. Rxe1 { [%clk 0:00:29] } 36... dxe4 { [%clk 0:00:09] } 37. Bxe6+ { [%clk 0:00:27] } 37... Kg7 { [%clk 0:00:09] } 38. dxe4 { [%clk 0:00:25] } 38... Rxc2 { [%clk 0:00:07] } 39. Bb3 { [%clk 0:00:25] } 39... Rg2 { [%clk 0:00:06] } 40. Rc1 { [%clk 0:00:23] } 40... Kf6 { [%clk 0:00:05] } 41. Rxc5 { [%clk 0:00:22] } 41... Bxe4 { [%clk 0:00:05] } 42. Rc4 { [%clk 0:00:20] } 42... Kf5 { [%clk 0:00:04] } 43. Rxb4 { [%clk 0:00:18] } 43... Rg1 { [%clk 0:00:04] } 44. Rb8 { [%clk 0:00:14] } 44... Bb1+ { [%clk 0:00:04] } 45. Ka3 { [%clk 0:00:13] } 45... Be4 { [%clk 0:00:03] } 46. Kb4 { [%clk 0:00:12] } 46... Kg5 { [%clk 0:00:03] } 47. Rg8+ { [%clk 0:00:10] } 47... Kf4 { [%clk 0:00:03] } 48. Rxg1 { [%clk 0:00:10] } 48... Bf3 { [%clk 0:00:02] } 49. Rf1 { [%clk 0:00:09] } 49... e4 { [%clk 0:00:02] } 50. Bd1 { [%clk 0:00:07] } 50... Kg3 { [%clk 0:00:01] } 51. Bxf3 { [%clk 0:00:07] } 51... exf3 { [%clk 0:00:01] } 52. Rg1+ { [%clk 0:00:07] } 1-0

Method

These tests were performed on Debian 11 (bullseye), using an Intel Xeon E5-1650 v2 CPU @ 3.50GHz. All files were situated in a tmpfs.

ToolVersion
gzip1.10
pigz2.6
lz41.9.3
xz5.2.5
bzip21.0.8
pbzip21.1.13
brotli1.0.9
zstd1.4.8
pzstd1.4.8

Each measurement was averaged over 3 runs.

It seems reasonable to assume that people dealing with huge PGN files do not care much about nuances like peak memory usage (if it does not grow arbitrarily), so the only considered metrics are file size and wall clock time.

Where paralellism was available, single-threaded usage and 8 threads were measured.

Ultimately, users may also want to consider other properties of the compression formats.

Compression

Decompression

Raw measurements

CompressionRun 1Run 2Run 3Avg RunFile size (bytes)RatioDecompressionRun 1Run 2Run 3Avg Run
cat0.2s0.2s0.2s0.2s10737417821.000cat0.2s0.2s0.2s0.2s
gzip38.5s38.7s38.6s38.6s2159878044.971gunzip4.8s4.7s4.7s4.7s
gzip -92:47.5s2:47.2s2:47.0s2:47.2s2085320365.149gunzip4.6s4.7s4.7s4.6s
pigz -p85.7s6.0s6.2s6.0s2161957014.967unpigz -p82.7s2.7s2.7s2.7s
pigz -p8 -922.8s23.5s23.5s23.2s2087696565.143unpigz -p82.6s2.7s2.7s2.7s
lz42.4s2.4s2.4s2.4s3486159783.080unlz41.0s1.0s1.0s1.0s
lz4 -121:44.0s1:44.5s1:45.8s1:44.8s2391866224.489unlz40.9s0.9s0.8s0.9s
lz4 -12 --favor-decSpeed1:48.7s1:52.7s1:48.9s1:50.1s2420414114.436unlz40.8s0.9s0.8s0.8s
xz9:17.8s9:19.0s9:19.9s9:18.9s1401573247.661unxz9.1s9.1s9.1s9.1s
xz -e11:10.8s11:12.1s11:11.3s11:11.4s1368590807.846unxz9.0s9.1s9.1s9.1s
xz -911:38.4s11:40.9s11:38.2s11:39.1s1307230728.214unxz9.1s9.1s9.1s9.1s
xz -9e14:43.9s14:41.5s14:43.1s14:42.8s1260395888.519unxz9.2s9.2s9.2s9.2s
xz -T81:46.0s1:45.9s1:45.9s1:45.9s1419800007.563unxz9.3s9.3s9.3s9.3s
xz -T8 -e2:06.5s2:06.3s2:06.4s2:06.4s1388467847.733unxz9.3s9.3s9.3s9.3s
xz -T8 -92:33.9s2:33.6s2:33.9s2:33.8s1321082328.128unxz9.0s9.0s9.0s9.0s
xz -T8 -9e3:10.4s3:10.9s3:11.0s3:10.8s1275786728.416unxz9.2s9.2s9.2s9.2s
bzip21:24.3s1:22.7s1:24.3s1:23.7s1267440888.472bunzip223.0s24.6s22.9s23.5s
pbzip2 -p815.7s16.7s15.7s16.0s1268286928.466pbzip2 -d -p87.0s7.0s7.0s7.0s
brotli38:40.7s38:41.3s38:37.9s38:40.0s1349760037.955brotli -d2.6s2.6s2.6s2.6s
zstd5.1s5.2s5.2s5.2s2217872814.841unzstd2.0s1.9s1.9s1.9s
zstd -1912:18.5s12:17.7s12:17.2s12:17.8s1477978007.265unzstd1.7s1.7s1.7s1.7s
zstd --ultra -2215:27.1s15:35.4s15:32.1s15:31.5s1359090977.900unzstd1.8s1.7s1.7s1.8s
zstd -T81.1s1.1s1.1s1.1s2217872814.841unzstd2.0s1.9s1.9s1.9s
zstd -T8 -192:27.7s2:25.5s2:27.1s2:26.8s1477978007.265unzstd1.8s1.7s1.7s1.7s
zstd -T8 --ultra -228:49.2s8:52.2s8:51.7s8:51.1s1359090977.900unzstd1.8s1.7s1.7s1.8s
pzstd -p81.1s1.1s1.1s1.1s2220088974.836pzstd -d -p80.6s0.6s0.6s0.6s
pzstd -p8 -191:55.8s1:50.6s1:54.0s1:53.5s1495371177.180pzstd -d -p80.6s0.7s0.7s0.7s
pzstd -p8 --ultra -227:01.0s6:54.1s6:49.9s6:55.0s1363645437.874pzstd -d -p81.0s1.1s1.1s1.0s

niklasf, 3rd November 2022, discuss on Twitter.