Intro
popcount, i.e. “population count”, is a CPU instruction that counts the number of 1-bits in a machine word. With the -mpopcnt
option, compilers like GCC and Clang will generate this instruction when needed. For example, std::bitset::count
(or std::popcount
since C++20). See this example. There’s even a more powerful instruction extension called AVX512VPOPCNTDQ, which process 512bits simultaneously. You can enable it via -mavx512vpopcntdq
.
The question I want to share today is what if our platform doesn’t support these instructions? Only software-level algorithms can solve it.
Brian Kernighan’s Algorithm
I’ll first show you a smart way using bitwise operations.
int popcount1(int n) {
int cnt = 0;
while (n) {
n &= n - 1; // key point
++cnt;
}
return cnt;
}
The key point here is n &= n - 1;
Whenever executing this operation, the rightmost 1-bit
and 0-bit
s after are inverted, so, the number of it is the count of 1-bit
s. See an example with n = 13 (1101)
,
n |
n-1 |
n & (n -1) |
cnt |
---|---|---|---|
1101 | 1100 | 1100 | 1 |
1100 | 1011 | 1000 | 2 |
1000 | 0111 | 0 | 3 |
Interestingly, modern compilers can recognize this pattern and turn it into popcnt
instruction (if supported). See demo.
SWAR Algorithm
For a 32-bit integer, grouped by 2-bit
and sum the 1-bit
s in each group separately, and for 4-bit
, 8-bit
… finally the 32-bit
as a whole. It’s a divide-and-conquer strategy. The summing problem for 32bits is divided into 2 summing problems for 16bits, then 4 problems for 8 bits, and so on. There are only $log_2(32) = 5$ steps.
The code looks as follows,
int popcount2(int n) {
n = (n & 0x55555555) + ((n >> 1) & 0x55555555); // (1) add 2 bits
n = (n & 0x33333333) + ((n >> 2) & 0x33333333); // (2) add 4 bits
n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); // (3)
n = (n & 0x00FF00FF) + ((n >> 8) & 0x00FF00FF); // (4)
n = (n & 0x0000FFFF) + ((n >> 16) & 0x0000FFFF); // (5)
return n;
}
You may get confused at the first sight of these hardcoded magic numbers. It’ll be clear after my explanation.
The binary representation of 0x5555555
is an 8-times repeated 0b0101
, i.e. a 16-times repeated 0b01
. Since the integer is split into groups of 2-bit, every group is mapped to a 0b01
. After (1)
is finished, every 2-bit field stores the number of 1-bit
s inside. 0x33333333
can be seen as an 8-times repeated 0b0011
for every quad. After (2)
is finished, every 4-bit field stores the number of 1-bit
s inside. And the rest steps are similar.
OK. This shows the rough idea and let’s do some optimizations for more run-time efficiency.
An optimized way of (1)
is:
n = n - (n >> 1) & 0x55555555;
Think about it from the perspective of a 2-bit
field. The right part will calculate if its even(left) bit is set. If the even bit is set to 1
and it’s on the second bit, since it’s representing 2
, minus the extra 1
is the correct result. If not, nothing happens; the result is if the odd bit is set. The following explains it,
bit pair | n - (n >> 1) & 0b01 |
---|---|
0b00 | 0b00 |
0b01 | 0b01 |
0b10 | 0b01 |
0b11 | 0b10 |
There’s a more general equation for this rule.
$$
popcount(x) = x - \sum_{n=1}^{31}\lfloor\frac{x}{n}\rfloor \qquad (x \ge 0)
$$
When there’s no danger that the sum of smaller bit fields will carry over into the next larger fields, the and operation can be omitted. For 4-bit
fields, there are up to 4 bits; for 8-bit
fields, the maximum value 4+4=8
is enough to be stored in 4-bit
fields. It’s safe to change (3)
into and no carry kicks in.
n = (n + (n >> 4)) & 0x0F0F0F0F;
From here, you can continue omitting the unnecessary ands for 16-bit
and 32-bit
fields, but there’s a more efficient optimization using fast multiply instruction.
n = (n * 0x01010101) >> 24;
It is the same as
n = (n + (n << 8) + (n << 16) + (n <<24)) >> 24;
It aims to sum all bit counts into one byte. Because the maximum value is 32(all bits set to 1
), one byte is wide enough to fit the value.
The final code is
int popcount2(int n) {
n = n - ((n >>1) & 0x55555555);
n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
n = (n + (n >> 4)) & 0x0F0F0F0F;
return (n * 0x01010101) >> 24;
}
This algorithm is called SWAR(SIMD Within A Register) because it performs parallel operations on 2/4/8/16-bit
fields inside an integer. I was surprised by this amazing algorithm when first met. This algorithm is also how GCC’s __builtin_popcount
is implemented for platforms without popcnt
instruction. There are also two versions for wider inputs, __builtin_popcountl,
and __builtin_popcountll
.
The algorithm is similar except for some subtle magic numbers. Say popcountll
for a 64-bit
integer,
int popcountll (long long n) {
n = n - ((n >> 1) & 0x5555555555555555);
n = (n & 0x3333333333333333) + ((n >> 2) & 0x3333333333333333);
n = (n + (n >> 4)) & 0xF0F0F0F0F0F0F0F;
return (n * 0x101010101010101) >> 56;
}
All magic numbers are promoted to 64-bit, and the last right-shift operand from $32-8=24$ to $64-8=56$.
Lookup Table
In some older versions of GCC, a lookup table strategy is employed (still used in some rare platforms like RL78 or AVR)
int popcount3(int n) {
alignas(64) static const uint8_t popcount_tab[256] = {
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
};
int res = 0;
for (int i = 0; i < 32; i += 8) {
res += popcount_tab[(n >> i) & 0xff];
}
return res;
}
popcount_tab[256]
stores popcount for every value from 0x00
to 0xff
. A 32-bit integer is split into four 8-bit values. In each iteration inside the loop, popcount the 8-bit via the table, and sum them up outside the loop.
You can make an even larger table for bitcounts of 0~2^16
and widen the chunk size to 16
bits. The method is unbeatable if the table is in the cache. However, a cache miss will slow it down from a few cycles into hundreds of cycles, which is generally unacceptable. Unless doing millions of popcount in a tight loop, you should never consider using it.