### Intro

popcount, i.e. “population count”, is a CPU instruction that counts the number of 1-bits in a machine word. With the `-mpopcnt`

option, compilers like GCC and Clang will generate this instruction when needed. For example, `std::bitset::count`

(or `std::popcount`

since C++20). See this example. There’s even a more powerful instruction extension called *AVX512VPOPCNTDQ*, which process 512bits simultaneously. You can enable it via `-mavx512vpopcntdq`

.

The question I want to share today is what if our platform doesn’t support these instructions? Only software-level algorithms can solve it.

## Brian Kernighan’s Algorithm

I’ll first show you a smart way using bitwise operations.

```
int popcount1(int n) {
int cnt = 0;
while (n) {
n &= n - 1; // key point
++cnt;
}
return cnt;
}
```

The key point here is `n &= n - 1;`

Whenever executing this operation, the **rightmost** `1-bit`

and `0-bit`

s after are inverted, so, the number of it is the count of `1-bit`

s. See an example with `n = 13 (1101)`

,

`n` |
`n-1` |
`n & (n -1)` |
`cnt` |
---|---|---|---|

1101 | 1100 | 1100 | 1 |

1100 | 1011 | 1000 | 2 |

1000 | 0111 | 0 | 3 |

Interestingly, modern compilers can recognize this pattern and turn it into `popcnt`

instruction (if supported). See demo.

## SWAR Algorithm

For a 32-bit integer, grouped by `2-bit`

and sum the `1-bit`

s in each group separately, and for `4-bit`

, `8-bit`

… finally the `32-bit`

as a whole. It’s a *divide-and-conquer* strategy. The summing problem for 32bits is divided into 2 summing problems for 16bits, then 4 problems for 8 bits, and so on. There are only $log_2(32) = 5$ steps.

The code looks as follows,

```
int popcount2(int n) {
n = (n & 0x55555555) + ((n >> 1) & 0x55555555); // (1) add 2 bits
n = (n & 0x33333333) + ((n >> 2) & 0x33333333); // (2) add 4 bits
n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F); // (3)
n = (n & 0x00FF00FF) + ((n >> 8) & 0x00FF00FF); // (4)
n = (n & 0x0000FFFF) + ((n >> 16) & 0x0000FFFF); // (5)
return n;
}
```

You may get confused at the first sight of these hardcoded magic numbers. It’ll be clear after my explanation.

The binary representation of `0x5555555`

is an 8-times repeated `0b0101`

, i.e. a 16-times repeated `0b01`

. Since the integer is split into groups of 2-bit, every group is mapped to a `0b01`

. After `(1)`

is finished, every 2-bit field stores the number of `1-bit`

s inside. `0x33333333`

can be seen as an 8-times repeated `0b0011`

for every quad. After `(2)`

is finished, every 4-bit field stores the number of `1-bit`

s inside. And the rest steps are similar.

OK. This shows the rough idea and let’s do some optimizations for more run-time efficiency.

An optimized way of `(1)`

is:

```
n = n - (n >> 1) & 0x55555555;
```

Think about it from the perspective of a `2-bit`

field. The right part will calculate if its even(left) bit is set. If the even bit is set to `1`

and it’s on the second bit, since it’s representing `2`

, minus the extra `1`

is the correct result. If not, nothing happens; the result is if the odd bit is set. The following explains it,

bit pair | `n - (n >> 1) & 0b01 ` |
---|---|

0b00 | 0b00 |

0b01 | 0b01 |

0b10 | 0b01 |

0b11 | 0b10 |

There’s a more general equation for this rule.
$$
popcount(x) = x - \sum_{n=1}^{31}\lfloor\frac{x}{n}\rfloor \qquad (x \ge 0)
$$
When there’s no danger that the sum of smaller bit fields will carry over into the next larger fields, the *and operation* can be omitted. For `4-bit `

fields, there are up to 4 bits; for `8-bit`

fields, the maximum value `4+4=8`

is enough to be stored in `4-bit`

fields. It’s safe to change `(3)`

into and no carry kicks in.

```
n = (n + (n >> 4)) & 0x0F0F0F0F;
```

From here, you can continue omitting the unnecessary *and*s for `16-bit`

and `32-bit`

fields, but there’s a more efficient optimization using fast multiply instruction.

```
n = (n * 0x01010101) >> 24;
```

It is the same as

```
n = (n + (n << 8) + (n << 16) + (n <<24)) >> 24;
```

It aims to sum all bit counts into one byte. Because the maximum value is 32(all bits set to `1`

), one byte is wide enough to fit the value.

The final code is

```
int popcount2(int n) {
n = n - (n >>1) & 0x55555555;
n = (n & 0x33333333) + (n >> 2) & 0x33333333;
n = (n + n >> 4) & 0x0F0F0F0F;
return (n * 0x01010101) >> 24;
}
```

This algorithm is called *SWAR*(SIMD Within A Register) because it performs parallel operations on `2/4/8/16-bit`

fields inside an integer. I was surprised by this amazing algorithm when first met. This algorithm is also how GCC’s `__builtin_popcount`

is implemented for platforms without `popcnt`

instruction. There are also two versions for wider inputs, `__builtin_popcountl,`

and `__builtin_popcountll`

.

The algorithm is similar except for some subtle magic numbers. Say `popcountll`

for a `64-bit`

integer,

```
int popcountll (long long n) {
n = n - ((n >> 1) & 0x5555555555555555);
n = (n & 0x3333333333333333) + ((n >> 2) & 0x3333333333333333);
n = (n + (n >> 4)) & 0xF0F0F0F0F0F0F0F;
return (n * 0x101010101010101) >> 56;
}
```

All magic numbers are promoted to 64-bit, and the last right-shift operand from $32-8=24$ to $64-8=56$.

## Lookup Table

In some older versions of GCC, a lookup table strategy is employed (still used in some rare platforms like *RL78* or *AVR*)

```
int popcount3(int n) {
alignas(64) static const uint8_t popcount_tab[256] = {
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
};
int res = 0;
for (int i = 0; i < 32; i += 8) {
res += popcount_tab[(n >> i) & 0xff];
}
return res;
}
```

`popcount_tab[256]`

stores popcount for every value from `0x00`

to `0xff`

. A 32-bit integer is split into four 8-bit values. In each iteration inside the loop, popcount the 8-bit via the table, and sum them up outside the loop.

You can make an even larger table for bitcounts of `0~2^16`

and widen the chunk size to `16`

bits. The method is unbeatable if the table is in the cache. However, a cache miss will slow it down from a few cycles into hundreds of cycles, which is generally unacceptable. Unless doing millions of popcount in a tight loop, you should never consider using it.