Intro
popcount, i.e. “population count”, is a CPU instruction that counts the number of 1-bits in a machine word. With the -mpopcnt
option, compilers like GCC and Clang will generate this instruction when needed. For example, std::bitset::count
(or std::popcount
since C++20). See this example. There’s even a more powerful instruction extension called AVX512VPOPCNTDQ, which process 512bits simultaneously. You can enable it via -mavx512vpopcntdq
.
The question I want to share today is what if our platform doesn’t support these instructions? Only software-level algorithms can solve it.
Brian Kernighan’s Algorithm
I’ll first show you a smart way using bitwise operations.
int popcount1(int n) {
int cnt = 0;
while (n) {
n &= n - 1; // key point
++cnt;
}
return cnt;
}
The key point here is n &= n - 1;
Whenever executing this operation, the rightmost 1-bit
and 0-bit
s after are inverted, so, the number of it is the count of 1-bit
s. See an example with n = 13 (1101)
,
n |
n-1 |
n & (n -1) |
cnt |
---|---|---|---|
1101 | 1100 | 1100 | 1 |
1100 | 1011 | 1000 | 2 |
1000 | 0111 | 0 | 3 |
Interestingly, modern compilers can recognize this pattern and turn it into popcnt
instruction (if supported). See demo.
SWAR Algorithm
For a 32-bit integer, grouped by 2-bit
and sum the 1-bit
s in each group separately, and for 4-bit
, 8-bit
… finally the 32-bit
as a whole. It’s a divide-and-conquer strategy. The summing problem for 32bits is divided into 2 summing problems for 16bits, then 4 problems for 8 bits, and so on. There are only steps.
The code looks as follows,
int popcount2(int n) {
n = (n & 0x55555555) + ((n >> 1) & 0x55555555); // (1) add 2 bits
n = (n & 0x33333333) + ((n >> 2) & 0x33333333); // (2) add 4 bits
n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); // (3)
n = (n & 0x00FF00FF) + ((n >> 8) & 0x00FF00FF); // (4)
n = (n & 0x0000FFFF) + ((n >> 16) & 0x0000FFFF); // (5)
return n;
}
You may get confused at the first sight of these hardcoded magic numbers. It’ll be clear after my explanation.
The binary representation of 0x5555555
is an 8-times repeated 0b0101
, i.e. a 16-times repeated 0b01
. Since the integer is split into groups of 2-bit, every group is mapped to a 0b01
. After (1)
is finished, every 2-bit field stores the number of 1-bit
s inside. 0x33333333
can be seen as an 8-times repeated 0b0011
for every quad. After (2)
is finished, every 4-bit field stores the number of 1-bit
s inside. And the rest steps are similar.
OK. This shows the rough idea and let’s do some optimizations for more run-time efficiency.
An optimized way of (1)
is:
n = n - ((n >> 1) & 0x55555555);
Think about it from the perspective of a 2-bit
field. The right part will calculate if its even(left) bit is set. If the even bit is set to 1
and it’s on the second bit, since it’s representing 2
, minus the extra 1
is the correct result. If not, nothing happens; the result is if the odd bit is set. The following explains it,
bit pair | n - ((n >> 1) & 0b01) |
---|---|
0b00 | 0b00 |
0b01 | 0b01 |
0b10 | 0b01 |
0b11 | 0b10 |
There’s a more general equation for this rule.
When there’s no danger that the sum of smaller bit fields will carry over into the next larger fields, the and operation can be omitted. For 4-bit
fields, there are up to 4 bits; for 8-bit
fields, the maximum value 4+4=8
is enough to be stored in 4-bit
fields. It’s safe to change (3)
into and no carry kicks in.
n = (n + (n >> 4)) & 0x0F0F0F0F;
From here, you can continue omitting the unnecessary ands for 16-bit
and 32-bit
fields, but there’s a more efficient optimization using fast multiply instruction.
n = (n * 0x01010101) >> 24;
It is the same as
n = (n + (n << 8) + (n << 16) + (n <<24)) >> 24;
It aims to sum all bit counts into one byte. Because the maximum value is 32(all bits set to 1
), one byte is wide enough to fit the value.
The final code is
int popcount2(int n) {
n = n - ((n >>1) & 0x55555555);
n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
n = (n + (n >> 4)) & 0x0F0F0F0F;
return (n * 0x01010101) >> 24;
}
This algorithm is called SWAR(SIMD Within A Register) because it performs parallel operations on 2/4/8/16-bit
fields inside an integer. I was surprised by this amazing algorithm when first met. This algorithm is also how GCC’s __builtin_popcount
is implemented for platforms without popcnt
instruction. There are also two versions for wider inputs, __builtin_popcountl,
and __builtin_popcountll
.
The algorithm is similar except for some subtle magic numbers. Say popcountll
for a 64-bit
integer,
int popcountll (long long n) {
n = n - ((n >> 1) & 0x5555555555555555);
n = (n & 0x3333333333333333) + ((n >> 2) & 0x3333333333333333);
n = (n + (n >> 4)) & 0xF0F0F0F0F0F0F0F;
return (n * 0x101010101010101) >> 56;
}
All magic numbers are promoted to 64-bit, and the last right-shift operand from to .
Lookup Table
In some older versions of GCC, a lookup table strategy is employed (still used in some rare platforms like RL78 or AVR)
int popcount3(int n) {
alignas(64) static const uint8_t popcount_tab[256] = {
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
};
int res = 0;
for (int i = 0; i < 32; i += 8) {
res += popcount_tab[(n >> i) & 0xff];
}
return res;
}
popcount_tab[256]
stores popcount for every value from 0x00
to 0xff
. A 32-bit integer is split into four 8-bit values. In each iteration inside the loop, popcount the 8-bit via the table, and sum them up outside the loop.
You can make an even larger table for bitcounts of 0~2^16
and widen the chunk size to 16
bits. The method is unbeatable if the table is in the cache. However, a cache miss will slow it down from a few cycles into hundreds of cycles, which is generally unacceptable. Unless doing millions of popcount in a tight loop, you should never consider using it.