Compression schemes

Ellis, P. J.; Bernstein, H. J.

doi:10.1107/97809553602060000756

International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

pdf | chapter contents | chapter index | related articles

International Tables for Crystallography (2006). Vol. G. ch. 5.6, p. 552

Section 5.6.3. Compression schemes

P. J. Ellis^a and H. J. Bernstein^b ^*

^a Stanford Linear Accelerator Center, 2575 Sand Hill Road, Menlo Park, CA 94025, USA, and ^bDepartment of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA
Correspondence e-mail: yaya@bernstein-plus-sons.com

5.6.3. Compression schemes

| top | pdf |

Two schemes for lossless compression of integer arrays (such as images) have been implemented in this version of CBFlib:

(i) an entropy-encoding scheme using canonical coding;

(ii) a CCP4-style packing scheme.

Both encode the difference (or error) between the current element in the array and the prior element. Parameters required for more sophisticated predictors have been included in the compression functions and will be used in a future version of the library.

5.6.3.1. Canonical-code compression

| top | pdf |

The canonical-code compression scheme encodes errors in two ways: directly or indirectly. Errors are coded directly using a symbol corresponding to the error value. Errors are coded indirectly using a symbol for the number of bits in the (signed) error, followed by the error itself.

At the start of the compression, CBFlib constructs a table containing a set of symbols, one for each of the 2ⁿ direct codes from $[-2^{n-1}]$ to $[2^{n-1}-1]$ , one for a stop code and one for each of the maxbits − n indirect codes, where n is chosen at compression time and maxbits is the maximum number of bits in an error. CBFlib then assigns to each symbol a bit code, using a shorter bit code for the more common symbols and a longer bit code for the less common symbols. The bit-code lengths are calculated using a Huffman-type algorithm and the actual bit codes are constructed using the canonical-code algorithm described by Moffat et al. (1997).

The structure of the compressed data is described in Table 5.6.3.1.

Table 5.6.3.1 | top | pdf |
Structure of compressed data using the canonical-code scheme

Byte	Value
1 to 8	Number of elements (64-bit little-endian number)
9 to 16	Minimum element
17 to 24	Maximum element
25 to 32	(Reserved for future use)
33	Number of bits directly coded, n
34	Maximum number of bits encoded, maxbits
35 to 35 + 2ⁿ − 1	Number of bits in each direct code
35 + 2ⁿ	Number of bits in the stop code
35 + 2ⁿ + 1 to 35 + 2ⁿ + maxbits − n	Number of bits in each indirect code
35 + 2ⁿ + maxbits − n + 1…	Coded data

5.6.3.2. CCP4-style compression

| top | pdf |

The CCP4-style compression writes the errors in blocks. Each block begins with a 6-bit code. The number of errors in the block is 2ⁿ, where n is the value in bits 0 to 2. Bits 3 to 5 encode the number of bits in each error. The data structure is summarized in Table 5.6.3.2.

Table 5.6.3.2 | top | pdf |
Structure of compressed data using the CCP4-style scheme

Value in bits 3 to 5	Number of bits in each error
0	0
1	4
2	5
3	6
4	7
5	8
6	16
7	65

Byte	Value
1 to 8	Number of elements (64-bit little-endian number)
9 to 16	Minimum element (currently unused)
17 to 24	Maximum element (currently unused)
25 to 32	(Reserved for future use)
33 …	Coded data

References

Moffat, A., Bell, T. C. & Witten, I. H. (1997). Lossless compression for text and images. Int. J. High Speed Electron. Syst. 8, 179–231.Google Scholar

International Tables for Crystallography (2006). Vol. G. ch. 5.6, p. 552