rhy0lite

22 June 2011

Find First Set and Popcount purpose

Newsgroups: comp.arch
From: pmontgom@cwi.nl (Peter L. Montgomery)
Subject: Re: Find first bit and population count
Organization: CWI, Amsterdam
Date: Mon, 24 Jun 1996 07:44:08 GMT

In article <4qkih7$85p@agate.berkeley.edu> davids@ICSI.Berkeley.EDU
(David Petrie Stoutamire) writes:

>An application I am working on needs to quickly identity the first set
>bit in a word, starting from LSB. Sparc v9 has a population count
>instruction from which find-first-bit can be synthesized:
>
> neg in, temp
> xnor in, temp, temp
> popc temp, result

Be careful, since in = 0 and in = 2^31 yield the
same values for in ^ (-in) using 32-bit operations.
It's safer to use (in - 1) & ~in or in & (-in).

>When I replace v8 hand-written code for find-first-bit with this
>instruction sequence on an ultrasparc I, performance drops severely on
>this section of code. Based on this, I suspect "popc" is not really
>supported in hardware. I couldn't find any on-line documentation from
>Sun one way or the other. Sun provides ffs() as a library, but
>disassembly shows this to not be a clever implementation, nor does it
>use "popc".
>
>This leads to my questions:
>
> 1. Which popular ISAs support find-first-bit or pop count?

Cray (leading zero count, population count).
Intel x86 (find first bit starting from right or left).
CDC Cyber series (population count).
I think the Alliant also had these operations.

Of these, only x86 is `popular' today.

> 3. Can any common applications make use of it?

It occurs in the binary GCD algorithm (ulong = unsigned long):

ulong binary_gcd (ulong x, ulong y)
{ if (x == 0 || y == 0) {
return x | y;
} else {
ulong n1 = x >> trailing_zero_count(x);
ulong n2 = y >> trailing_zero_count(y);
while (n1 != n2) { /* Both are odd */
if (n1 > n2) {
n1 -= n2;
n1 >>= trailing_zero_count(n1);
} else {
n2 -= n1;
n2 >>= trailing_zero_count(n2);
}
}
return n1 << trailing_zero_count(x | y); } } Yes, we can replace n1 >>= trailing_zero_count(n1);
by
do {n1 >>= 1;} while (n1 & 1);

since n1 is known to be even and nonzero at that point.
But we desire an algorithm with predictable branches.
The IF-ELSE can be replaced by a branchless sequence which first
finds MIN(x, y) and MAX(x, y), on architectures with
conditional moves. It's harder to optimize
trailing_zero_count without hardware support such as
the population count function.

In a linear algebra application over the field GF(2),
I need the dot product of two binary vectors v1 and v2.
It is population_count(v1 & v2) & 1.

Using the binary method of exponentiation, computing
a power b**e takes truncated_base2_logarithm(e) squarings
and population_count(e)-1 multiplications, for integer e > 0.

In 1970, after I had taken a statistics course, and was
vending machine manager at my dormitory, I took a
survey in which everyone marked the products he liked.
Let V[v, p] = 1 if voter v likes product p,
and V[v, p] = 0 if voter v dislikes product p.
I computed sum_v V[v, p1] * V[v, p2] for all pairs (p1, p2),
as part of the formula for the correlation between p1 and p2.
A few weeks later, in an assembly language course, I learned
that the machine (a CDC 6400) lacked an integer multiply
instruction even though this was a primitive in the programming
language (Fortran) I had learned earlier.
Had I used binary AND rather than multiplication,
the program would have run faster (recall all values are 0 or 1).
This was in the days when CPU usage was expensive.
By the end of the term, I realized I could have
packed (for fixed p) all V[v, p] values into a few 60-bit words
(3 words since we had 150 voters). A few bitwise ANDs
and population counts would produce the sums I needed.

The operations of population count, finding the leftmost
nonzero bit (i.e., base 2 logarithm), and finding the rightmost
nonzero bit (i.e., finding the exponent of 2 dividing an integer)
occur frequently enough in applications that programming languages
should have these primitives and compilers should inline them.
At this time, however, their performance is not critical
for my applications.

> 4. Are there clever hacks for fast find-first-bit? The pop count hack:

One can compute

temp = in & (-in);

Then temp = 0 if in = 0; otherwise temp is a power of 2.
If p > 32 (or p > 64) is a prime for which 2 is a primitive root,
then you can compute (temp % p) and look up the answer
using a memory reference.
Alas, integer division is slow. If the high half of the
integer product is available, you can get temp/p
using one multiplication and a few simple operations.
I recall that SPARC v9 has the high half of a 32 x 32
product, but not of a 64 x 64 product.

If you are willing to use floating point, try

float f = (float) (signed long) (in | (-in));

Now look at the exponent of f (which is the negative of a
power of 2 unless in = 0).
--
Peter L. Montgomery pmontgom@cwi.nl San Rafael, California
My sex is male. I am ineligible for an abortion.

From: glew@ichips.intel.com (Andy Glew)
Newsgroups: comp.arch
Subject: Re: Find first bit and population count
Date: 30 Jun 1996 07:05:33 GMT
Organization: Intel Corporation

From: d.sand@ix.netcom.com(Duane Sand)
Subject: Re: Find first bit and population count
Newsgroups: comp.arch
Date: 24 Jun 1996 18:19:48 GMT
Organization: Netcom

davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:
> 3. Can any common applications make use of it?

I believe the decryption spooks at the National Security Agency are
very fond of the population-count operation. It seems to be so useful
to them that every US-made govt-subsidized supercomputer has included
this instruction even though there is little other use for it.

Actually, speech recognition people (and probably any pattern
recognition people) have a use for POP count:

POP( A AND B ) = the inner product (dot product) of one bit precision feature
vectors
POP( A XOR B ) = Huffman distance metric


A preliminary stage in many pattern recognition algorithms is to
compute a quick distance metric between the input pattern, and some
subset of the reference patterns in the database. You can do this via
Huffman distance, or you can interpret it is a projection.
When projecting, you can arbitrarily tradeoff precision for speed
- i.e. sometimes you might do the dot product in 64 bit floating
point, sometimes in the 8 bit precision that Intel's MMX and similar
"multimedia" instruction sets provide. 1 bit precision is just the
extrapolation of this.

Thus, you can use POP count to take a quick check of many simple
reference vectors, afterwards zooming in on a more full precision
check.

Come to think of it, this is probably the same reason that the NSA
uses.
---
Andy "Krazy" Glew, glew@ichips.intel.com, Intel,
M/S JF3-206, 2111 NE 25th Ave., Hillsboro, OR 97124
Place URGENT in email subject line for mail filter prioritization.
DISCLAIMER: private posting, not representative of employer.

We're looking for a few good microarchitects for our research labs.
{{ VLIW: the leading edge of the last generation of computer architecture }}


From: preston@Tera.COM (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: Find first bit and population count
Date: 24 Jun 1996 09:20:25 -0700

davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:

> 1. Which popular ISAs support find-first-bit or pop count?

The Tera (certainly the most popular architecture around here) has
instructions for pop-count, find-first-1 and find-first-0, from left
or right).

> 3. Can any common applications make use of it?

Pop count might be used for cardinality of sets represented as bit
vectors. We use find-leftmost-1 (along with bitwise-matrix-multiply)
to support some of the common C string manipulation functions. Lets
us do things like copy or compare NULL-terminated strings at a rate of
8 bytes per 3 instructions.

Preston Briggs

From: preston@Tera.COM (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: Find first bit and population count
Date: 1 Jul 1996 11:10:35 -0700

I wrote:
> We use find-leftmost-1 (along with bitwise-matrix-multiply)
> to support some of the common C string manipulation functions. Lets
> us do things like copy or compare NULL-terminated strings at a rate of
> 8 bytes per 3 instructions.

And Cliff Click wrote:
>Ok, I give. How does find-leftmost-1 help in C string functions?

Imagine we're doing the the easy one -- strlen.
We want to do it in word-sized chunks, 8 bytes per load.
We load a word, then do a bitwise matrix multiply with -1,
then find-leftmost-0 to see if there are any 0 bits in the result.
If so, we've found the end of the string and the position of the bit,
divided by 8, should be added to the accumulated length.

Obviously the trickiness is in the bitwise matrix multiply. This
operation treats a pair of 64-bit words as 8x8 bit matrices and
multiplies them using the boolean operations of AND for multiplying
two bits and either OR of XOR for summing. For this application, we
need the OR variant (the XOR version is useful for computing GF(2)).
Imagine we have 8 bytes of a string that look like

abcdefg\0

organized as an 8x8 bit matrix, we'd have

01100001 11111111 11111111
01100010 11111111 11111111
01100011 11111111 11111111
01100100 x 11111111 = 11111111
01100101 11111111 11111111
01100110 11111111 11111111
01100111 11111111 11111111
00000000 11111111 00000000

If you're counting cycles, recall that we issue 3 operations per tick,
and all operations require the same time. The operations used in the
inner loop are

load
inc
bit-mat-mul
find-leftmost-0
conditional branch

which we software pipeline into 3 instructions.

Preston Briggs

Newsgroups: comp.arch
Subject: Re: Find first bit and population count
From: Bruce@hoult.actrix.gen.nz (Bruce Hoult)
Date: Tue, 2 Jul 1996 20:22:42 +1200 (NZST)

cliffc@ami.sps.mot.com (Cliff Click) writes:
> preston@Tera.COM (Preston Briggs) writes:
>
> davids@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:
>
> > 1. Which popular ISAs support find-first-bit or pop count?
> > 3. Can any common applications make use of it?
>
> Pop count might be used for cardinality of sets represented as bit
> vectors. We use find-leftmost-1 (along with bitwise-matrix-multiply)
> to support some of the common C string manipulation functions. Lets
> us do things like copy or compare NULL-terminated strings at a rate of
> 8 bytes per 3 instructions.
>
> Preston Briggs
>
> Ok, I give. How does find-leftmost-1 help in C string functions?

The expression...

~n & (n - 0x01010101) & 0x80808080

... gives zero if and only if none of the bytes in the word are nonzero.
If the result is nonzero then the position of the first 1 tells you which
byte was nonzero.
You might want to disassemble your own company's "libmoto", where the C
string functions make use of this property. :-)

-- Bruce

12 June 2011

Strategic Planning
  • Goal
  • Objective
  • Strategy
  • Plan
  • Execution

10 June 2011

don't generate 400 lines of IR for 5 trivial lines of source code. There's just no way LLVM could possibly optimize all of that away.

In particular, never, ever generate diamond-shaped control-flow. Don't annoy the compiler with lots of compensation code in the slow path, either. Just branch to a fallback routine. Don't assume LLVM is going to inline a branchy helper function and reduce it to nothing -- inline the fast path yourself.

The source language (or a properly designed bytecode) provides much more high-level information. You lose all of this by mindlessly lowering it into a low-level IR. LLVM would need this info for better alias-analysis. But since the info is already gone, it can't effectively weed out the forest of loads and stores.

Better forget about the C-centric features of LLVM, e.g. the standard calling conventions. And due to the nature of dynamic code, regular LICM doesn't work well, either. All those expensive optimizations in LLVM don't buy you much, since most of them won't work with dynamic code that includes tons of guards.

Basically you'd need to do a lot of optimizations yourself at a higher level, e.g. forwarding/sinking of stack slot accesses. Only pass the cleaned up IR to LLVM -- use it more like a dumb assembler. That bears the question, though, whether LLVM is the right tool for the job.

Mike Pall LLVM comment

A Rant About PHP Compilers and HipHop

08 June 2011

CloudFlare CDN + security

06 June 2011

Unladen Swallow Retrospective