Algorithms behind Popcount

Intro

popcount, i.e. “population count”, is a CPU instruction that counts the number of 1-bits in a machine word. With the -mpopcnt option, compilers like GCC and Clang will generate this instruction when needed. For example, std::bitset::count (or std::popcount since C++20). See this example. There’s even a more powerful instruction extension called AVX512VPOPCNTDQ, which process 512bits simultaneously. You can enable it via -mavx512vpopcntdq.

The question I want to share today is what if our platform doesn’t support these instructions? Only software-level algorithms can solve it.

Brian Kernighan’s Algorithm

I’ll first show you a smart way using bitwise operations.

int popcount1(int n) {
    int cnt = 0;
    while (n) {
        n &= n - 1; // key point
        ++cnt;
    }
    return cnt;
}

The key point here is n &= n - 1; Whenever executing this operation, the rightmost 1-bit and 0-bits after are inverted, so, the number of it is the count of 1-bits. See an example with n = 13 (1101),

`n`	`n-1`	`n & (n -1)`	`cnt`
1101	1100	1100	1
1100	1011	1000	2
1000	0111	0	3

Interestingly, modern compilers can recognize this pattern and turn it into popcnt instruction (if supported). See demo.

SWAR Algorithm

For a 32-bit integer, grouped by 2-bit and sum the 1-bits in each group separately, and for 4-bit, 8-bit… finally the 32-bit as a whole. It’s a divide-and-conquer strategy. The summing problem for 32bits is divided into 2 summing problems for 16bits, then 4 problems for 8 bits, and so on. There are only $log_2(32) = 5$ steps.

The code looks as follows,

int popcount2(int n) {
    n = (n & 0x55555555) + ((n >> 1) & 0x55555555); // (1) add 2 bits
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333); // (2) add 4 bits
    n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); // (3)
    n = (n & 0x00FF00FF) + ((n >> 8) & 0x00FF00FF); // (4)
    n = (n & 0x0000FFFF) + ((n >> 16) & 0x0000FFFF); // (5)
    return n;
}

You may get confused at the first sight of these hardcoded magic numbers. It’ll be clear after my explanation.

The binary representation of 0x5555555 is an 8-times repeated 0b0101, i.e. a 16-times repeated 0b01. Since the integer is split into groups of 2-bit, every group is mapped to a 0b01. After (1) is finished, every 2-bit field stores the number of 1-bits inside. 0x33333333 can be seen as an 8-times repeated 0b0011 for every quad. After (2) is finished, every 4-bit field stores the number of 1-bits inside. And the rest steps are similar.

OK. This shows the rough idea and let’s do some optimizations for more run-time efficiency.

An optimized way of (1) is:

n = n - ((n >> 1) & 0x55555555);

Think about it from the perspective of a 2-bit field. The right part will calculate if its even(left) bit is set. If the even bit is set to 1 and it’s on the second bit, since it’s representing 2, minus the extra 1 is the correct result. If not, nothing happens; the result is if the odd bit is set. The following explains it,

bit pair	`n - ((n >> 1) & 0b01)`
0b00	0b00
0b01	0b01
0b10	0b01
0b11	0b10

There’s a more general equation for this rule. $$ popcount(x) = x - \sum_{n=1}^{31}\lfloor\frac{x}{2^n}\rfloor \qquad (x \ge 0) $$ When there’s no danger that the sum of smaller bit fields will carry over into the next larger fields, the and operation can be omitted. For 4-bit fields, there are up to 4 bits; for 8-bit fields, the maximum value 4+4=8 is enough to be stored in 4-bit fields. It’s safe to change (3) into and no carry kicks in.

n = (n + (n >> 4)) & 0x0F0F0F0F;

From here, you can continue omitting the unnecessary ands for 16-bit and 32-bit fields, but there’s a more efficient optimization using fast multiply instruction.

n = (n * 0x01010101) >> 24;

It is the same as

n = (n + (n << 8) + (n << 16) + (n <<24)) >> 24;

It aims to sum all bit counts into one byte. Because the maximum value is 32(all bits set to 1), one byte is wide enough to fit the value.

The final code is

int popcount2(int n) {
    n = n - ((n >>1) & 0x55555555);
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
    n = (n + (n >> 4)) & 0x0F0F0F0F;
    return (n * 0x01010101) >> 24;
}

This algorithm is called SWAR(SIMD Within A Register) because it performs parallel operations on 2/4/8/16-bit fields inside an integer. I was surprised by this amazing algorithm when first met. This algorithm is also how GCC’s __builtin_popcount is implemented for platforms without popcnt instruction. There are also two versions for wider inputs, __builtin_popcountl, and __builtin_popcountll.

The algorithm is similar except for some subtle magic numbers. Say popcountll for a 64-bit integer,

int popcountll (long long n) {
    n = n - ((n >> 1) & 0x5555555555555555);
    n = (n & 0x3333333333333333) + ((n >> 2) & 0x3333333333333333);
    n = (n + (n >> 4)) & 0xF0F0F0F0F0F0F0F;
    return (n * 0x101010101010101) >> 56;
}

All magic numbers are promoted to 64-bit, and the last right-shift operand from $32-8=24$ to $64-8=56$.

Lookup Table

In some older versions of GCC, a lookup table strategy is employed (still used in some rare platforms like RL78 or AVR)

int popcount3(int n) {
    alignas(64) static const uint8_t popcount_tab[256] = {
        0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
        3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
    };
    int res = 0;
    for (int i = 0; i < 32; i += 8) {
        res += popcount_tab[(n >> i) & 0xff];
    }
    return res;
}

popcount_tab[256] stores popcount for every value from 0x00 to 0xff. A 32-bit integer is split into four 8-bit values. In each iteration inside the loop, popcount the 8-bit via the table, and sum them up outside the loop.

You can make an even larger table for bitcounts of 0~2^16 and widen the chunk size to 16bits. The method is unbeatable if the table is in the cache. However, a cache miss will slow it down from a few cycles into hundreds of cycles, which is generally unacceptable. Unless doing millions of popcount in a tight loop, you should never consider using it.

Intro#

Brian Kernighan’s Algorithm#

SWAR Algorithm#

Lookup Table#

References#

Intro

Brian Kernighan’s Algorithm

SWAR Algorithm

Lookup Table

References