Efficient bit-scanning in Lisp

In this post I’m going to describe something I found quite exciting. I am in the process of developing a modern chess engine in Common Lisp, and I am trying to implement the low-level functionality as efficiently as possible. In this case, for one of the most performance-critical aspects of the program, I managed to achieve performance comparable to C. I am always happy when I can make Lisp run as fast as C, and this experience has left me in awe of SBCL.

But first, some background on the problem I was trying to solve.

It is common for modern chess-playing programs to represent chess boards with bits. A chess board is 8 by 8, meaning it has 64 cells. This means, if you have one 64-bit unsigned integer per piece type (one for white pawns, etc), you can represent the whole board using 1′s and 0′s in a very compact and efficient way. This type of board representation is called a “bitboard.”

A huge bonus of bitboards is that there are bitwise operations you can do on boards to determine information really quickly. For example, to build a bitboard of all white pieces, you can and together the boards representing each white piece type. This can also be useful in evaluation: to determine if the white player has any pieces in the center of the board, for example, it can and the white bitboards with a pre-defined bitboard with 1′s in the center cells. If white has no pieces in that region, the operation will result in 0.

One of the big ideas in bitboards is to avoid loops whenever possible. If the program ends up looping over the bits in a number, a lot of the benefit of using bitboards is lost.

So, how would a program iterate over the pieces in a bitboard without looping over the bits? The answer is bit-scanning. It comes in two flavors: forward and backward bit-scanning. For this post I’m going to describe forward bit-scanning, but everything here applies equally to the reverse.

Essentially, forward bit-scanning returns the index of the least significant 1 bit in a number. This is what you want if you want to iterate through only the 1 bits in a number. This happens a lot in many performance-critical places in a chess program, so it’s important that it be as fast as possible. Luckily, there are several techniques for optimizing it (including using a special hardware instruction on Intel CPUs… more on that later).

For the most popular general approach, you need to be aware of something called de Bruijn sequences. This can be used to construct a hash function that determines the least significant 1 bit of a number.

A 64-bit de Bruijn sequence contains 64 overlapped 6-bit sequences, forming a circle of 64 bits, where five leading zeros overlap five hidden trailing zeros. The 1 bit from the bitboard can be isolated through the intersection of the number and its two’s complement. From there, multiplying the 64-bit de Bruijn sequence with the isolated number yields a unique 6-bit sub-sequence. Those six bits can be extracted and used as the hash key to look up the index of the least significant 1 bit in a pre-computed table.

The following code is from the Chess Programming Wiki:

The math behind it is complicated and confusing, but the implementation is simple and fast. What might have been a loop through each bit in the number is now a few instructions and a single memory reference. That’s a big improvement, certainly.

Now, this kind of thing is C’s bread and butter. The C programming language is great at low-level bit manipulation tasks like this. Lisp, on the other hand, is not usually thought of as a language good at low-level tasks.

I briefly considered using bit vectors for bitboards in Lisp, and just accepting the performance penalty of looping. I knew that SBCL could easily do unboxed arithmetic on 64-bit integers, however, so I gave that a shot.

The first thing to note about the above C code, if you didn’t notice, is that the code alternately treats the same value as signed and unsigned. In C, negation on an unsigned number works that way, which is what the algorithm calls for.

I could have come up with a different way of doing the same operation, but I wanted to mimic the above C code as closely as I could, so I specified all the types in my function such that it switches between treating the number as unsigned and signed as necessary.

I started with the bitscan-slow function, and I added optimizations incrementally, checking the generated code with (disassemble #'bitscan) after each change. SBCL helpfully highlights places in the disassembled code where it is calling generic functions for arithmetic (like negation). To ensure that it knew the exact type of the number at each step, I ended up specifying the type using the for each sub-expression. This allowed me to control its representation exactly, even switching between unsigned to signed and back again. The generated code looks like this:

You can trace through the assembly and match all the actions back to the lisp code. SBCL did exactly what I told it to do. The assembly code in this case is almost more compact than the Lisp code above, but the verbosity of the Lisp could easily be abstracted away with macros. The important point is that I was able to make Lisp yield this compact, efficient machine code. No jumps, function calls, memory allocations, or anything like that.

But I can do better.

Somewhere near the beginning of this post I mentioned that Intel CPUs have a special instruction for bit-scanning. Actually, they have two. Bit-scan forward (bsf) and bit-scan reverse (bsr).

In a language like C it’s relatively easy to include inline assembly code, though it’s dangerous. Most C compilers provide some kind of “intrinsic” support for instructions like bsf and bsr, however, and SBCL does not.

Luckily, SBCL allows you to define your own “virtual operators” that let you teach SBCL’s compiler new tricks by specifying assembly code in s-expression form. This has huge implications beyond this blog post (you can have macros generate assembly language, for example). I just want a way to invoke the Intel bsf instruction. Turns out it was pretty easy, though there’s a noticeable lack of documentation on writing VOPs.

The function definition at the bottom seems paradoxical, but it’s just to “wrap” the operator in a function. I won’t go into the details of defknown or define-vop, but I will show you the assembly code generated for my new bitscan function:

That’s more like it!

I briefly compared the performance of the Lisp code and the C code. I compared the de Bruijn versions, because using the programs using the built-in instruction were essentially identical in both cases.

  • GCC with -O0: 1250 ms
  • Ordinary GCC: 500 ms
  • SBCL: 490 ms
  • GCC with -O3: 150 ms

The results were essentially what I expected: compiled with no extra flags with GCC, the C code was slightly slower than the Lisp code. When compiled with -O3, the C code was faster.

To even detect these differences, however, I had to run the function thousands of times.

I chose to use the bsr instruction in my chess program, and I came away from this experience very impressed with SBCL as a compiler and platform: not only was it able to produce machine code similar to that of GCC for the de Bruijn function, it was extendable enough to allow me to use a custom instruction freely and naturally in my code.

5 thoughts on “Efficient bit-scanning in Lisp”

  1. Using declarations to lie to SBCL is a really bad idea; if the compiler is just clever enough, you might end up seeing your code “optimised” into nothingness.

    In addition to good support for machine integer arithmetic, SBCL is also able to identify overflowing arithmetic operations (like C implicitly does for unsigned integers), in portable CL code. There is currently no such cleverness for arithmetic negation in the master branch, but I intend to merge the 16 LOC to do so in the next release. The result is:

    CL-USER> (defun bitscan2 (bb)
    (declare (optimize speed)
    (type (unsigned-byte 64) bb))
    (let ((ashbb (ash (ldb (byte 64 0)
    (* (logand bb (- bb))
    debruijn64))
    -58)))
    (aref index64 ashbb)))
    BITSCAN2
    CL-USER> (disassemble *)
    ; disassembly for BITSCAN2
    ; Size: 62 bytes
    ; 0874B0AD: 488BC2 MOV RAX, RDX ; no-arg-parsing entry point
    ; B0: 48F7D8 NEG RAX
    ; B3: 4821C2 AND RDX, RAX
    ; B6: 48BBC2284E9AE5D5ED07 MOV RBX, 571347909858961602
    ; C0: 488BC2 MOV RAX, RDX
    ; C3: 48F7E3 MUL RAX, RBX
    ; C6: 488BC8 MOV RCX, RAX
    ; C9: 48C1E93A SHR RCX, 58
    ; CD: 48D1E1 SHL RCX, 1
    ; D0: 488B0529FFFFFF MOV RAX, [RIP-215] ; #(63 0 58
    ; 1 …)
    ; D7: 0FB7540801 MOVZX EDX, WORD PTR [RAX+RCX+1]
    ; DC: 48D1E2 SHL RDX, 1
    ; DF: 488BE5 MOV RSP, RBP
    ; E2: F8 CLC
    ; E3: 5D POP RBP
    ; E4: C3 RET
    ; E5: 0F0B0A BREAK 10 ; error trap
    ; E8: 02 BYTE #X02
    ; E9: 19 BYTE #X19 ; INVALID-ARG-COUNT-ERROR
    ; EA: 9A BYTE #X9A ; RCX

    The CL source is fully portable: it will output the same result on any conforming implementation. It does so by explicit stating what is implicit in the C code: we’re only interested in the lower 64 bits of the multiplication. FWIW, without the (LDB …) wrapped around it, the negation would still be performed in modular arithmetic (its result is immediately ANDed with a 64 bit value).

    In portable CL again, integer-length is a slightly souped up BSR.

    We try to make sure SBCL enables the development of safe, mostly portable and performant code. If you find yourself doing questionable hacks involving declarations and (safety 0), I would love to hear about your problems on the mailing list. It may be possible to rewrite the code into something more palatable; failing that, the request might provide the drive to improve things, particularly low-hanging fruits.

    1. Thank you for the comment!

      I was trying to avoid the need for calling GENERIC-NEGATE, which happens in my SBCL version when I compile your function. Sounds like your patch for the next release fixes this. I generally try to avoid (safety 0), as I’ve had it bite me before, so I can see why you would tell me it’s a bad idea.

      And thanks for the integer-length recommendation. I didn’t know about that function.

      You made me wish I had gone to the mailing list before I started investigating this. I think you could have saved me a lot of effort!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">