Byte encoding format#
The Tibs.encode() method stores an arbitrary Tibs as a sequence of bytes which can
be used to reconstruct the Tibs via decode().
There are different codecs that are used to compress the data, both a general use Zstandard codec and a Rice codec, which is particularly good at sparse data.
The raw encoding is also very efficient, and all the encoded sequences contain the bit length, which means that they can be safely concatenated without losing any information.
The base implementation does a good job at the smaller bit sequences that compression algorithms would be very inefficient at storing, for example all bit sequences up to 6 bits long are encoded into a single byte. For longer sequences the raw codec overhead is still small.
The choice between Tibs and Mutibs is not part of the encoded data, so
if a Tibs and Mutibs are equal they will encode to the same bytes.
Tibs length |
Raw encoded byte overhead |
|---|---|
0 to 6 bits |
+0 bytes |
7 to 64 bits |
+1 byte |
65 to 1016 bits |
+2 bytes |
1017 to 131064 bits |
+3 bytes |
… |
… |
1 MiB |
+4 bytes |
As an example of using the Rice codec, which is very good at sparse data, let’s compress the sparsest data possible - ten billion zero bits:
>>> b = Tibs.from_zeros(10_000_000_000).encode(Codec.Rice)
>>> b
b'\x0c\x05\xfc\xf5@\xbe?\xf0'
>>> t = Tibs.decode(b)
>>> t.count(0)
10000000000
Obviously compressing ten billion bits into eight bytes is an edge case, but for comparison the Zstandard compression (with default parameters) would use 38 kB for this sequence.
This section gives a format specification, and although it isn’t a formal spec, it should allow other implementations to encode and decode.
Overview#
In this section the notation a..b includes both endpoints a and b.
Each encoded Tibs is in one of three forms:
Single byte form (only 0..6 bits)
Short form (only 7..64 bits)
Long form (any length)
The single byte and short forms can only be used to encode 0..6 bits and 7..64 bits respectively.
The long form can be used for any length, but is required for lengths >64 bits.
The encoding and decoding methods are symmetric. Note that when decoding, any illegal or reserved values encountered are considered errors.
The bit length of the Tibs determines which of the three encodings can be used:
1. Single byte (0..6 bits)#
The first bit must be set. The remaining bits of the byte decode the data as follows:
bit0: single_byte_flag = 1
bit1: is_six_bits_flag
if is_six_bits_flag:
bit2..bit7: bit_data
else:
bit2: is_five_bits_flag
if is_five_bits_flag:
bit3..bit7: bit_data
else:
bit3: is_four_bits_flag
if is_four_bits_flag:
bit4..bit7: bit_data
else:
bit4: is_three_bits_flag:
if is_three_bits_flag:
bit5..bit7: bit_data
else:
bit5: is_two_bits_flag:
if is_two_bits_flag:
bit6..bit7: bit_data
else:
bit6: is_one_bit_flag:
if is_one_bit_flag:
bit7: bit_data
else:
bit7: 1 # Zero bit length
For this single byte, decoding the bit_data will give a sequence of zero to six bits.
The value 10000000 does not correspond to a valid encoding and is reserved.
As an example, the byte 10001110 would be decoded as:
1: single_byte_flag
0: is_six_bits_flag
0: is_five_bits_flag
0: is_four_bits_flag
1: is_three_bits_flag
110: bit_data
so this represents a 3-bit sequence with the value 110.
2. Short form (7..64 bits)#
For short bit sequences bit0 will be unset and bit1 will be set.
The rest of the byte gives the byte length and padding:
bit0: single_byte_flag = 0
bit1: short_form_flag = 1
bit2..bit4: byte_length_minus_1
bit5..bit7: bit_padding
The byte_length_minus_1 will be in the range 0..7, and so will be used for data lengths of
1 to 8 bytes. The bit_padding value is the number of bits to truncate from the end of the
data bytes, and bit lengths of 1 to 6 are reserved because they must use the single byte form.
The data for this is then stored in the next 1 to 8 bytes, left aligned.
For example, the byte 01001111 would be decoded as:
0: single_byte_flag
1: short_form_flag
001: byte_length_minus_1
111: bit_padding
byte_length_minus_1 is 1 and bit_padding is 7, so this will be followed by two
bytes of data with the final seven bits ignored. Including the header byte, the sequence
01001111_11100011_10000000 represents a 9-bit
sequence with the value 111000111.
3. Long form (65+ bits)#
The long form is required for encoding 65 bits or greater. Although it can be used on shorter sequences it is not recommended as it will be less efficient than the specialised encodings.
For long form sequences, both bit0 and bit1 will be unset.
The first byte’s format will be:
bit0: single_byte_flag = 0
bit1: short_form_flag = 0
bit2..bit4: codec
bit5..bit7: bit_padding
There are 3 bits to specify the codec.
|
Byte codec |
|---|---|
|
Raw |
|
Rice |
|
Zstd |
|
Reserved |
The bit_padding decodes as an unsigned integer in the range 0..7. This gives the number
of bits to truncate from the end of the decoded bytes, so allows all bit lengths to be stored.
After this header, a variable-length integer (varint) byte_length is decoded from one
or more bytes:
bit0: continuation_flag
bit1..bit7: length_data
Varint rules:
Each varint byte contributes 7 data bits (
length_data).These are concatenated in the order they are encountered.
continuation_flag == 1means another varint byte follows.continuation_flag == 0marks the final varint byte.
A first varint byte equal to 10000000 is not permitted and is reserved.
Raw decoding#
If the codec is ‘Raw’, this is then followed by byte_length bytes. The raw bit sequence is formed from these bytes with bit_padding bits at the end removed.
This bit sequence is just the bits of the Tibs.
Rice decoding#
If the codec is ‘Rice’, next comes a configuration byte:
bit0..4: k (unsigned int, range 0 - 31)
bit5: sparse_bit
bit6: final_bit
bit7: reserved
This is then followed by byte_length bytes, with bit_padding bits at the end removed in the same way as for raw.
This bit sequence is then decoded as follows:
The bit sequence is a concatenation of Rice-coded unsigned integers using the configured
kvalue.Each integer is decoded in the usual Rice form:
Count a unary prefix of
1bits up to and excluding the next0bit. This count is the quotientq.Consume that
0separator bit.Consume the next
kbits as an unsigned integer remainderr. Ifk == 0, then no remainder bits are consumed andr = 0.The decoded gap value is
gap = q * 2**k + r.
The decoded gaps describe runs of the opposite bit value associated with occurrences of
sparse_bit.If
sparse_bit == 1, each gap gives the number of0bits before the next1bit.If
sparse_bit == 0, each gap gives the number of1bits before the next0bit.For each decoded gap:
Append
gapcopies of the opposite bit.Append one
sparse_bit.
After all gaps have been decoded, replace the final decoded bit with
final_bit.
This means the encoded gaps always reconstruct a sequence ending in sparse_bit. The
final_bit field then overwrites that last bit so that the decoded sequence can end in
either bit value.
For example, let’s decode the four byte sequence b'\x09\x01.\xbe'.
This corresponds to the binary 00001001_00000001_00101110_10111110
byte 00001001
0: single_byte_flag
0: short_form_flag
001: codec (Rice)
001: bit_padding (=1)
byte 00000001
0: length_continuation_flag
0000001: length_data (=1)
So this is Rice encoded with 1 byte of data, with the final bit ignored (so 7 bits of encoded data).
The next byte is the Rice configuration byte 00101110
00101: k (=5)
1: sparse_bit
1: final_bit
0: reserved
and finally the encoded data, which we now know is just 7 bits 1011111
1: prefix count
0: end of prefix => q = 1
11111: r = 31
There is a single 1 bit before the first 0 bit, so the count of these bits gives us q=1.
After the 0 bit we read the next k bits to get the unsigned integer r=31.
This gives us a decoded gap of gap = q * 2**k + r = 1 * 2**5 + 31 = 63.
The sparse_bit is a 1, so the gaps are made of 0 bits. And the final_bit is a 1, so
we just have 63 zero bits followed by a one bit and the decoded sequence is
00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000001
Zstd decoding#
Larger chunks of binary data are often best compressed with a more general algorithm, and Zstandard is a modern, effective and fast option. The main byte payload is compressed, and the extra metadata needed for the Tibs is still stored in the header.
For a final example let’s encode rather than decode. Let’s say we have a 50 bit sequence of all 1.
This could be efficiently coded with Rice encoding, but let’s use Raw just to demonstrate.
First the header byte is:
0: single_byte_flag
0: short_form_flag
000: codec (Raw)
110: bit_padding (6)
Then we encode the byte length. We need 7 bytes to store our 50 bits, so we encode the number 7:
0: length_continuation_flag
0000111: length_data (7)
We don’t need any further bytes to store the byte length, so the length_continuation_flag is set to 0 and
no further length bytes are encoded.
Finally we pad the data with 0 bits up to the byte boundary and store it:
11111111_11111111_11111111_11111111_11111111_11111111_11000000
so the final sequence is
00000110_00000111_11111111_11111111_11111111_11111111_11111111_11111111_11000000
header byte_len bit_data padding
Notes#
The public decoder expects a single complete encoded Tibs value. It raises
ValueError if additional bytes remain after that value.