Adrien Grand: How fast is bit packing?

One of the most anticipated changes in Lucene/Solr 4.0 is its improved memory efficiency. Indeed, according to several benchmarks, you could expect a 2/3 reduction in memory use for a Lucene-based application (such as Solr or ElasticSearch) compared to Lucene 3.x.

One of the techniques that Lucene uses to reduce its memory footprint is bit-packing. This means that integer array values, instead of being fixed-size (8, 16, 32 or 64 bits per value), can have any size in the [1-64] range. If you store 17-bits integers this way, this is a 47% reduction of the size of your array compared to an int[]!

Here is what the interface looks like:

interface Mutable {
  long get(int index);
  void set(int index, long value);
  int size();
}

Under the hood, this interface has 4 implementations that have different speed and memory efficiency:

Direct8, Direct16, Direct32 and Direct64 that just wrap a byte[], a short[], an int[] or a long[],
Packed64, which packs values contiguously in 64-bits (long) blocks,
Packed64SingleBLock, that looks like Packed64 but uses padding bits to prevent values from spanning across several blocks (32 bits per value at most),
Packed8ThreeBlocks and Packed16ThreeBlocks, that store values in either 3 bytes (24 bits per value) or 3 shorts (48 bits per value).

In case you are interested, the code is available in Lucene svn repository.

Direct{8,16,32,64}

The methods of these classes directly translate to operations on an array:

Direct8: byte[],
Direct16: short[],
Direct32: int[],
Direct64: long[].

Java is smart enough to make operations on these classes as fast as if you directly modified the underlying array. However, these implementations also have the same drawback as arrays, which is that if you want to store 17-bits values, you will need to use a Direct32, which has a 88% memory overhead for 17-bits values.

Packed64

This implementation stores values contiguously in 64-bits blocks. This is the most compact implementation: if you want to store a million17-bits values, it will require roughly 17 * 1000000 / 8 ~= 2MB space. One pitfall is that some values may span across two different blocks (when the number of bits per value is not a divisor of n), as a consequence, to avoid costly CPU branches, the implementation of the get and set methods are a little tricky and always update 2 blocks with different shifts and masks.

Packed64SingleBlock

This implementation is similar to Packed64 but does not allow its values to span across several blocks. If you want to store 21-bits values, every block will consist of 3 21-bits values (using 3*21=63 bits) and 64-63=1 padding bit (2% space loss). Here are the different value sizes that this class accepts.

Bits per value	Values per block	Padding bits	Space loss
32	2	0	0%
21	3	1	2%
16	4	0	0%
12	5	4	6%
10	6	4	6%
9	7	1	2%
8	8	0	0%
7	9	1	2%
6	10	4	6%
5	12	4	6%
4	16	0	0%
3	21	1	2%
2	32	0	0%
1	64	0	0%

Packed{8,16}ThreeBlock

This class uses 3 bytes or shorts to store a single value. It is well-suited for 24 and 48-bits values, but has a maximum size of Integer.MAX_VALUE/3 (so that the underlying array can be addressed by an int).

How do they compare?

For every number of bits per value, there are 2 to 4 available implementations. One important criterion to select the one that best suits your needs is the memory overhead.

Here are the memory overheads for every number of bits per value and bit-packing scheme. The X-axis is the number of bits per value while the Y-axis is the memory overhead (space loss / actually used space).

For every bit-packing scheme, I only considered the most compact implementation. I could use a Direct64 to store 20-bits values, but it is very likely to have similar (probably a little worse since the CPU cache is less likely to help) performance to a Direct32, although it requires twice as more space.

For example, there are 4 available implementations to store 20-bits values:

Direct32 (32 bits per value), which has 60% memory overhead
Packed64 (20 bits per value), which has 0% memory overhead
Packed64SingleBlock (21 bits + 1/3 padding bit per value), which has 7% memory overhead
Packed8ThreeBlocks (24 bits per value), which has 20% memory overhead

Even if we now know how compact the different implementations are, it is still very difficult to decide which implementation to use whithout knowing their relative performance characteristics. This is why I wrote a simple benchmark that for every number of bits per value in [1,64]:

creates 2 to 4 packed integer arrays (one per implementation) of size 10,000,000
tests their random write performance (offsets are randomly chosen in the [0, 10000000[ range),
tests their random read performance.

The X-axis is the number of bits per value while the Y-axis is the number of read/written values per second.

The Direct* implementations are clearly faster than the packed implementations (~3x faster than Packed64 and 2x faster than Packed64SingleBlock). However, it is interesting to observe that the Packed*ThreeBlocks implementations are almost as fast as the Direct* implementations.

Packed64 and Packed64SingleBLock are much faster with small values (1 or 2 bits), due to the fact that the CPU caches can hold many more values at the same time, resulting in fewer cache misses when trying to access the data.

Now, how do read operations compare?

This time results are very different. The fastest implementation are still the Direct* ones, but they are only ~18% faster than Packed64 and Packed*ThreeBlocks, and only ~8% faster than Packed64SingleBLock on average. This means that for read-only use cases, you could save a lot of memory by switching your arrays to a packed implementation while keeping performance to the same level.

Conclusion

Although bit-packing can help reduce memory use significantly, it is very rarely used in practice, probably because:

people usually don’t know how many bits per value they actually need,
8, 16, 32 and 64-bits arrays are language built-ins, while packed arrays require some extra coding.

However, this experiment shows that you can achieve significant reductions in memory use by using packed integer arrays, without sacrificing performance too much since paked arrays can be almost as fast as raw arrays, especially for read operations.

Adrien Grand: How fast is bit packing?

Direct{8,16,32,64}

Packed64

Packed64SingleBlock

Packed{8,16}ThreeBlock

How do they compare?

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112