Artifact 4c6e248322da5ac950685cb2d7e5d6c5d37df53be5b31782a56e49e33a2ead47:
- File www/image-format-vs-repo-size.md — part of check-in [05f95db774] at 2019-03-24 17:25:28 on branch trunk — Added "Image Format vs Fossil Repo Size" article and supporting files. (user: wyoung size: 10629)
Image Format vs Fossil Repo Size
The Problem
Fossil has a delta compression feature which removes redundant information from a file when checking in a subsequent version.¹ That delta is then zlib-compressed before being stored in the Fossil repository database file.
These two steps have a few practical consequences when it comes to storing already-compressed files:
Binary data compression algorithms such as zlib turn the file data into pseudorandom noise. Typical data compression algorithms are not hash functions, where the goal is that a change to each bit in the input has a statistically even chance of changing every bit in the output, but because they do approach that pathalogical condition, pre-compressed data tends to defeat Fossil’s delta compression algorithm, there being so little correlation between two different outputs from the binary data compression algorithm.
An ideal lossless binary data compression algorithm cannot be applied more than once to make the data even smaller, since random noise is incompressible. The consequence for our purposes here is that pre-compressed data doesn’t benefit from Fossil’s zlib compression.
Key Advice
If you read no further, the takeaway from the prior two points is that you should not store already-compressed data in a Fossil repository. You’ll defeat both of its compression methods, ballooning the Fossil repository size.
The remainder of this article shows the consequences of ignoring this advice. We’ll use 2D image files as our example here, but realize that this advice also applies to many other file types:
Microsoft Office: The XML-based document formats used from Office 2003 onward (
.docx,.xlsx,.pptx, etc.) are Zip files containing an XML document file and several collateral files. The same is true of LibreOffice’s ODF files.Java: A
.jarfile is a Zip file containing JVM.classfiles, manifest files, and more.Windows Installer: An
*.msifile is a proprietary database format that contains, among other things, Microsoft Cabinet-compressed files, which in turn may hold Windows executables, which may themselves be compressed.SVG, PDF: Many file formats are available in both compressed and uncompressed forms. For the same basic reason as we will illustrate below, you should use the uncompressed form with Fossil wherever practical.
Demonstration
The image-format-vs-repo-size.ipynb file in this directory is a
Jupyter notebook implementing the following experiment:
Create an empty Fossil repository, and save its initial size.
Use ImageMagick via Wand to generate a JPEG file of a particular size — currently 256 px² — filled with Gaussian noise to make data compression difficult.
Check that image into the new Fossil repo, and remember that size.
Change a random pixel in the image to a random RGB value, save that image, check it in, and remember the new Fossil repo size.
Iterate on step 4 some number of times — currently 10 — and remember the Fossil repo size at each step.
Repeat the above steps for BMP, TIFF,² and PNG.
Create a bar chart showing how the Fossil repository size changes with each checkin.
We chose to use Jupyter for this because it makes it easy for you to
modify the notebook to try different things. Want to see how the
results change with a different image size? Easy, change the size
value in the second cell of the notebook. Want to try more image
formats? You can put anything ImageMagick can recognize into the
formats list. Want to find the break-even point for images like those
in your own respository? Easily done with a small amount of code.
Results
Running the notebook gives a bar chart something like³ this:
There are several points of interest in that chart:
The initial repository size (group "0") is the same in all four cases. This is the normal overhead for an empty Fossil repository, about 200 kiB.
BMP and uncompressed TIFF are nearly identical in size for all checkins. A low-tech format like BMP will have a small edge in practice because TIFF metadata includes the option for multiple timestamps, UUIDs, etc., which bloat the checkin size by creating many small deltas. If you don't need the advantages of TIFF, a less capable image file format will give smaller checkin sizes for a given amount of change.
Because both PNG and Fossil use the zlib binary data compression algorithm, the first checkin (group “1”) is approximately the same size for PNG, BMP, and TIFF.
The repo size balloons on the first 3 checkins due to SQLite page overhead and such.
Once we’re past that initial settling-in point, the repo size goes up negligibly for BMP and TIFF due to Fossil’s delta compression algorithm: a single-pixel change results in a very small increase in the Fossil repo size, as we want. PNG and JPEG, though, show large increases on each checkin because this same tiny input change causes a large change in the computed delta.
Because JPEG’s lossy nature allows it to start smaller and have smaller size increases than than PNG, the crossover point with BMP/TIFF isn’t until 7-9 checkins in typical runs of this test. Given a choice among these four file formats and a willingness to use lossy image compression, a rational tradeoff is to choose JPEG for repositories where each image will change fewer than that number of times.
Automated Recompression
Since programs that produce and consume binary-compressed data files
often make it either difficult or impossible to work with the
uncompressed form, we want an automated method for producing the
uncompressed form to make Fossil happy while still having the compressed
form to keep our content creation applications happy. This Makefile
will do that for several different compressed file types:
.SUFFIXES: .bmp .png .svg .svgz
.svgz.svg:
gzip -dc < $< > $@
.svg.svgz:
gzip -9c < $< > $@
.bmp.png:
convert -quality 95 $< $@
.png.bmp:
convert $< $@
SS_FILES := $(wildcard spreadsheet/*)
all: $(SS_FILES) illus.svg image.bmp doc-big.pdf
reconstitute: illus.svgz image.png
unzip spreadsheet.xlsx -d spreadsheet
qpdf doc-small.pdf doc-big.pdf
$(SS_FILES): spreadsheet.xlsx
unzip $@ -d $<
doc-big.pdf: doc-small.pdf
qpdf --stream-data=uncompress $@ $<
This Makefile allows you to treat the compressed version as the
process input, but to actually check in only the changes against the
uncompressed version by typing “make” before “fossil ci”.
Because it’s based on dependency rules, only the necessary files are
generated on each make command.
You only have to run “make reconstitute” once after opening a fresh
Fossil checkout to produce those compressed sources. After that, you
work with the compressed files in your content creation programs.
The Makefile illustrates two primary strategies:
Input and Ouput File Formats Differ by Extension
In the case of SVG and the bitmap image formats, the file name extension
differs between the cases, so we can use make suffix rules to get the
behavior we want. The top half of the Makefile just tells make how
to map from *.svg to *.svgz and vice versa, and the same for *.bmp
to/from *.png.
Same Extension
We don’t have that luxury for Excel and PDF files, for different reasons:
Excel: Excel has no way to work with the unpacked Zip file contents at all, so we have to unpack it into a subdirectory, which is what we check into Fossil. On making a fresh Fossil checkout, we have to pack that subdirectory’s contents back up into an
*.xlsxfile with “make reconstitute” so we can edit it with Excel again.PDF: All PDF readers can display an uncompressed PDF file, but many PDF-producing programs have no option for uncompressed output. Since the file name extension is the same either way, we treat the compressed PDF as the source to the process, yielding an automatically-uncompressed PDF for the benefit of Fossil. Unlike with the Excel case, there is no simple “file base name to directory name” mapping, so we just created the
-bigto-smallname scheme here.
Footnotes
Several other programs also do delta compression, so they’ll also be affected by this problem: rsync, Unison, Git, etc. When using file copying and synchronization programs without delta compression, it’s best to use the most highly-compressed file format you can tolerate, since they copy the whole file any time any bit of it changes.
We're using uncompressed TIFF here, not LZW- or Zip-compressed TIFF, either of which would give similar results to PNG, which is always zlib-compressed.
The raw data changes somewhat from one run to the next due to the use of random noise in the image to make the zlib/PNG compression more difficult, and the random pixel changes. Those test design choices make this a Monte Carlo experient. We’ve found that the overall character of the results don’t change much from one run to the next.