Find First Set and Popcount purpose

Newsgroups: comp.arch

From: pmontgom@cwi.nl (Peter L. Montgomery)

Subject: Re: Find first bit and population count

Organization: CWI, Amsterdam

Date: Mon, 24 Jun 1996 07:44:08 GMT

In article <4qkih7$85p@agate.berkeley.edu> davids@ICSI.Berkeley.EDU

(David Petrie Stoutamire) writes:

>An application I am working on needs to quickly identity the first set

>bit in a word, starting from LSB. Sparc v9 has a population count

>instruction from which find-first-bit can be synthesized:

>

> neg in, temp

> xnor in, temp, temp

> popc temp, result

Be careful, since in = 0 and in = 2^31 yield the

same values for in ^ (-in) using 32-bit operations.

It's safer to use (in - 1) & ~in or in & (-in).

>When I replace v8 hand-written code for find-first-bit with this

>instruction sequence on an ultrasparc I, performance drops severely on

>this section of code. Based on this, I suspect "popc" is not really

>supported in hardware. I couldn't find any on-line documentation from

>Sun one way or the other. Sun provides ffs() as a library, but

>disassembly shows this to not be a clever implementation, nor does it

>use "popc".

>

>This leads to my questions:

>

> 1. Which popular ISAs support find-first-bit or pop count?

Cray (leading zero count, population count).

Intel x86 (find first bit starting from right or left).

CDC Cyber series (population count).

I think the Alliant also had these operations.

Of these, only x86 is `popular' today.

> 3. Can any common applications make use of it?

It occurs in the binary GCD algorithm (ulong = unsigned long):

ulong binary_gcd (ulong x, ulong y)

{ if (x == 0 || y == 0) {

return x | y;

} else {

ulong n1 = x >> trailing_zero_count(x);

ulong n2 = y >> trailing_zero_count(y);

while (n1 != n2) { /* Both are odd */

if (n1 > n2) {

n1 -= n2;

n1 >>= trailing_zero_count(n1);

} else {

n2 -= n1;

n2 >>= trailing_zero_count(n2);

}

}

return n1 << trailing_zero_count(x | y); } } Yes, we can replace n1 >>= trailing_zero_count(n1);

by

do {n1 >>= 1;} while (n1 & 1);

since n1 is known to be even and nonzero at that point.

But we desire an algorithm with predictable branches.

The IF-ELSE can be replaced by a branchless sequence which first

finds MIN(x, y) and MAX(x, y), on architectures with

conditional moves. It's harder to optimize

trailing_zero_count without hardware support such as

the population count function.

In a linear algebra application over the field GF(2),

I need the dot product of two binary vectors v1 and v2.

It is population_count(v1 & v2) & 1.

Using the binary method of exponentiation, computing

a power b**e takes truncated_base2_logarithm(e) squarings

and population_count(e)-1 multiplications, for integer e > 0.

In 1970, after I had taken a statistics course, and was

vending machine manager at my dormitory, I took a

survey in which everyone marked the products he liked.

Let V[v, p] = 1 if voter v likes product p,

and V[v, p] = 0 if voter v dislikes product p.

I computed sum_v V[v, p1] * V[v, p2] for all pairs (p1, p2),

as part of the formula for the correlation between p1 and p2.

A few weeks later, in an assembly language course, I learned

that the machine (a CDC 6400) lacked an integer multiply

instruction even though this was a primitive in the programming

language (Fortran) I had learned earlier.

Had I used binary AND rather than multiplication,

the program would have run faster (recall all values are 0 or 1).

This was in the days when CPU usage was expensive.

By the end of the term, I realized I could have

packed (for fixed p) all V[v, p] values into a few 60-bit words

(3 words since we had 150 voters). A few bitwise ANDs

and population counts would produce the sums I needed.

The operations of population count, finding the leftmost

nonzero bit (i.e., base 2 logarithm), and finding the rightmost

nonzero bit (i.e., finding the exponent of 2 dividing an integer)

occur frequently enough in applications that programming languages

should have these primitives and compilers should inline them.

At this time, however, their performance is not critical

for my applications.

> 4. Are there clever hacks for fast find-first-bit? The pop count hack:

One can compute

temp = in & (-in);

Then temp = 0 if in = 0; otherwise temp is a power of 2.

If p > 32 (or p > 64) is a prime for which 2 is a primitive root,

then you can compute (temp % p) and look up the answer

using a memory reference.

Alas, integer division is slow. If the high half of the

integer product is available, you can get temp/p

using one multiplication and a few simple operations.

I recall that SPARC v9 has the high half of a 32 x 32

product, but not of a 64 x 64 product.

If you are willing to use floating point, try

float f = (float) (signed long) (in | (-in));

Now look at the exponent of f (which is the negative of a

power of 2 unless in = 0).

--

Peter L. Montgomery pmontgom@cwi.nl San Rafael, California

My sex is male. I am ineligible for an abortion.

From: glew@ichips.intel.com (Andy Glew)

Newsgroups: comp.arch

Subject: Re: Find first bit and population count

Date: 30 Jun 1996 07:05:33 GMT

Organization: Intel Corporation

From: d.sand@ix.netcom.com(Duane Sand)

Subject: Re: Find first bit and population count

Newsgroups: comp.arch

Date: 24 Jun 1996 18:19:48 GMT

Organization: Netcom

davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:

> 3. Can any common applications make use of it?

I believe the decryption spooks at the National Security Agency are

very fond of the population-count operation. It seems to be so useful

to them that every US-made govt-subsidized supercomputer has included

this instruction even though there is little other use for it.

Actually, speech recognition people (and probably any pattern

recognition people) have a use for POP count:

POP( A AND B ) = the inner product (dot product) of one bit precision feature

vectors

POP( A XOR B ) = Huffman distance metric

A preliminary stage in many pattern recognition algorithms is to

compute a quick distance metric between the input pattern, and some

subset of the reference patterns in the database. You can do this via

Huffman distance, or you can interpret it is a projection.

When projecting, you can arbitrarily tradeoff precision for speed

- i.e. sometimes you might do the dot product in 64 bit floating

point, sometimes in the 8 bit precision that Intel's MMX and similar

"multimedia" instruction sets provide. 1 bit precision is just the

extrapolation of this.

Thus, you can use POP count to take a quick check of many simple

reference vectors, afterwards zooming in on a more full precision

check.

Come to think of it, this is probably the same reason that the NSA

uses.

---

Andy "Krazy" Glew, glew@ichips.intel.com, Intel,

M/S JF3-206, 2111 NE 25th Ave., Hillsboro, OR 97124

Place URGENT in email subject line for mail filter prioritization.

DISCLAIMER: private posting, not representative of employer.

We're looking for a few good microarchitects for our research labs.

{{ VLIW: the leading edge of the last generation of computer architecture }}

From: preston@Tera.COM (Preston Briggs)

Newsgroups: comp.arch

Subject: Re: Find first bit and population count

Date: 24 Jun 1996 09:20:25 -0700

davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:

> 1. Which popular ISAs support find-first-bit or pop count?

The Tera (certainly the most popular architecture around here) has

instructions for pop-count, find-first-1 and find-first-0, from left

or right).

> 3. Can any common applications make use of it?

Pop count might be used for cardinality of sets represented as bit

vectors. We use find-leftmost-1 (along with bitwise-matrix-multiply)

to support some of the common C string manipulation functions. Lets

us do things like copy or compare NULL-terminated strings at a rate of

8 bytes per 3 instructions.

Preston Briggs

From: preston@Tera.COM (Preston Briggs)

Newsgroups: comp.arch

Subject: Re: Find first bit and population count

Date: 1 Jul 1996 11:10:35 -0700

I wrote:

> We use find-leftmost-1 (along with bitwise-matrix-multiply)

> to support some of the common C string manipulation functions. Lets

> us do things like copy or compare NULL-terminated strings at a rate of

> 8 bytes per 3 instructions.

And Cliff Click wrote:

>Ok, I give. How does find-leftmost-1 help in C string functions?

Imagine we're doing the the easy one -- strlen.

We want to do it in word-sized chunks, 8 bytes per load.

We load a word, then do a bitwise matrix multiply with -1,

then find-leftmost-0 to see if there are any 0 bits in the result.

If so, we've found the end of the string and the position of the bit,

divided by 8, should be added to the accumulated length.

Obviously the trickiness is in the bitwise matrix multiply. This

operation treats a pair of 64-bit words as 8x8 bit matrices and

multiplies them using the boolean operations of AND for multiplying

two bits and either OR of XOR for summing. For this application, we

need the OR variant (the XOR version is useful for computing GF(2)).

Imagine we have 8 bytes of a string that look like

abcdefg\0

organized as an 8x8 bit matrix, we'd have

01100001 11111111 11111111

01100010 11111111 11111111

01100011 11111111 11111111

01100100 x 11111111 = 11111111

01100101 11111111 11111111

01100110 11111111 11111111

01100111 11111111 11111111

00000000 11111111 00000000

If you're counting cycles, recall that we issue 3 operations per tick,

and all operations require the same time. The operations used in the

inner loop are

load

inc

bit-mat-mul

find-leftmost-0

conditional branch

which we software pipeline into 3 instructions.

Preston Briggs

Newsgroups: comp.arch

Subject: Re: Find first bit and population count

From: Bruce@hoult.actrix.gen.nz (Bruce Hoult)

Date: Tue, 2 Jul 1996 20:22:42 +1200 (NZST)

cliffc@ami.sps.mot.com (Cliff Click) writes:

> preston@Tera.COM (Preston Briggs) writes:

>

> davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:

>

> > 1. Which popular ISAs support find-first-bit or pop count?

> > 3. Can any common applications make use of it?

>

> Pop count might be used for cardinality of sets represented as bit

> vectors. We use find-leftmost-1 (along with bitwise-matrix-multiply)

> to support some of the common C string manipulation functions. Lets

> us do things like copy or compare NULL-terminated strings at a rate of

> 8 bytes per 3 instructions.

>

> Preston Briggs

>

> Ok, I give. How does find-leftmost-1 help in C string functions?

The expression...

~n & (n - 0x01010101) & 0x80808080

... gives zero if and only if none of the bytes in the word are nonzero.

If the result is nonzero then the position of the first 1 tells you which

byte was nonzero.

You might want to disassemble your own company's "libmoto", where the C

string functions make use of this property. :-)

-- Bruce