Skip to content

PivCo-Huffman “merge” operations

There’s a new paper out called “PivCo-Huffman” (HTML version with annotations here) and it’s very interesting. Normal Huffman decoding (and, to a lesser extent, encoding) is inherently quite serial. We can get explicit parallelism by using multiple streams, which scales just fine to moderate numbers of streams – something like 4-8 is usually not an issue. Not very suitable for vectorization or wide vector machines like GPUs, though: every extra stream adds signaling overhead in the bitstream, and decoding from many streams at once is a gather-heavy operation. GPUs can cope with gathers okay (although spread-out data accesses still consume a lot more cycles than more compact memory access patterns do), most everything else is very unhappy with extremely gather-heavy patterns.

Especially since our usual table-based decoders will then immediately follow up with another gather, although that part is more easily fixed: canonical codes admit reasonably easy table-less decoding (at least for the critical “length determination” part), if desired. These approaches are not particularly juicy in a scalar one-at-a-time decoder, where a single load is very cheap, but if you’re working on 32-wide vectors, doing a bunch of integer math that saves you a gather is a more interesting proposition.

You can also use ANS-style interleaving to turn many logical bitstreams into a single physical bitstream. This is what GDeflate does to get a bitstream that has more potential parallelism without gathering from all over the place. Interleaving solves the “reads all over memory” problem and works great on GPUs and CPUs that provide suitable facilities (e.g. AVX-512 “EXPAND” family instructions, or just fast gathers from the same cache line), but is slow without that kind of hardware support. It also forces you to pick a magic number (the interleave factor) as part of the bitstream definition. Too low, and wide vector machines like GPUs get bad utilization when decoding. Too high, and scalar machines can’t efficiently decode the bitstream at all. Worse, numbers that are just right for say AVX-512 or GPUs overlap, but the smallest number of interleaved streams most GPUs or even CPUs with something like AVX-512 can get decent utilization with is around 32, the largest number scalar or narrower vector machines with typical register counts can comfortably decode is around 8-16 (depending on how exactly you do it), and in the middle between 16 and 32 is a valley of tears where nobody’s happy.

That’s why you don’t see much of this style of bitstream outside of GDeflate; it is inherently reliant on you picking a magic number that has a huge influence on every single implementation, yet at the same time is subject to ever-shifting hardware details. Yes, NVidia GPUs have been built around warps with 32 “threads” (more like SIMD lanes) using 32 bits per register each for a good while, but AMD used to prefer 64 (before switching to mostly-32 a few generations ago), Intel GPUs were designed around 8, some mobile GPUs used to do as little as 4, and 4 is also the number of 32-bit lanes you get for something like ARM NEON or SSE4.2 on the x86 side, later widened to 8 with AVX2. Finally, ARM SVE and the RISC-V V extension leave it implementation defined, which is one thing when the model is that you’re running a simple vector math kernel like a SAXPY, but quite another when the physical vector width ends up being a magic constant that is baked into some persistent wire format, potentially for decades.

The final major popular solution to the “this is very serial” conundrum is to completely brute-force it. Fetch a bunch of bits from the input. In parallel, start decoding at bit 0, and at a bit 1, at bit 2, 3, and so forth. Do this for a lot of sequential candidate positions. I’m talking something like 16 or more.

Eventually, we discover that the code starting at bit 0 was 6 bits long. Cool! That means we discard all the work we did at bit offsets 1 through 6, and look at what our already pre-decoded code starting at bit 6 was. That one is 5 bits long, so we discard the speculative work we did for bits 7 through 10, and find that starting at bit 11, there’s another code that’s 5 bits long. Congratulations, we’ve just decoded 3 values for the price of 16.

This works just fine. Believe it or not, this is actually the preferred method of decoding such a sequential bitstream in hardware if a throughput of more than one value per cycle is required. Especially with canonical codes and not-too-large length limits, this can be made reasonably cheap. Chasing down the sequence of actual values to emit at the end is a potential scalability problem, but 3 or 4 at a time isn’t too bad, and if that’s not enough, there’s better algorithms to chase these links that give yet more parallelism. In short, this can be done, and if you build dedicated hardware for it, it’s not bad, but it’s not something you ever want to do in SW on general-purpose hardware. I’ve implemented it out of curiosity, and even on a GPU, it turns out routinely throwing away more than 80% of the work you do is not a recipe for fast code, and let’s not even talk about wasted energy when you try to pull this kind of stunt on a power-constrained battery-powered device.

So far, that’s been our main options: we can decode sequentially one at a time, but that doesn’t scale. We can use multiple separate bitstreams, which theoretically gives us something like thread-level or perhaps instruction-level parallelism, but that amount of parallelism is baked into the bitstream and adds some amount of overhead (since each decoder threads needs to know where to start). It’s also not friendly to SIMD-style implementations because the divergent memory access patterns make us sad. We can use interleaved bitstreams which have much better memory access locality and are a good match to SIMD and GPU architectures in principle, but need us to pick a magic number at bitstream specification time that will forever limit the amount of decoder parallelism we can get, but also ruin our day if the number we pick is high enough to make some of our target HW run out of registers. And even within the same class of device (say, GPUs), there’s not anything resembling agreement on what that number ought to be. Finally, we can shrug, be extremely wasteful, and discard most of the work we did immediately. It’s parallelism all right, but it’s hardly work-efficient.

Enter PivCo

Marcin Żukowski’s “Pivco-Huffman” turns this on its side. Literally. I’ll go through this rather quickly to get to the meat of this post, refer to the paper for more details. For concreteness, suppose we want to encode the string “abracadabra”. Building the standard Huffman tree (it’s a textbook algorithm that I’m not going to explain here) yields something like this picture:

Image
Huffman tree for “abracadabra”

Ordinarily, to code any symbol, you start at the root and send the labels of the edges you need to take to get to the appropriate leaf of the tree. The “a” encodes as just “0” in this tree, “b” encodes as “100”, r as “101” and so forth. We send each symbol fully before moving on to the next one.

PivCo-Huffman instead conceptually processes the entire string at once, pushing it slowly down the tree. Starting at the root node, we see that every position that contains an “a” sends a “0”, and everything else sends a “1”. Lining the string up, we can see what bits are output for the root node:

  abracadabra
  01101010110

These bits go straight into our output bitstream. The left subtree of the root node is just the “a” leaf, so every position that sent a 0 is effectively done. But the positions that weren’t “a” aren’t done yet. Lets work out what those are, and what the next step they need to take is:

  brcdbr
  001100

These are the bits output for the right child of the root node. In general, at every internal node, we partition the symbols that got this far into two smaller subsequences: one that descends into the left subtree and one that descends into the right subtree. In our running example, the subsequence that continues left is “brbr”, while the right subsequence is just “cd”. We proceed our preorder tree traversal this way until we’ve pushed everything down to the leaves.

Note this is sending the exact same bits regular Huffman coding using this tree would, just in a different order.

To decode this, we need to know how many symbols are encoded in our bitstream, let’s say N. Every symbol must have passed through the root node, so we know the bit string for the root node has N bits in it. We can count how many 0s and 1s there are in that bit string, which tells us how many symbols make it into the left (0) and right (1) subtrees. Then, recursively, we can do the same thing with these nodes. This lets us locate where the individual nodes’ bits end up in the final bitstream, and how many symbols pass through them.

Finally, on the way back up the tree, now that we have the lengths of the subsequences, we can reconstruct what they might have been in a bottom-up merge process.

Take the parent node for the “b” and “r” leaves, for example. We know that 4 symbols “pass through there”, and the corresponding bitstream turns out to be “0101”. This means that the subsequence of symbols coded in this subtree is “brbr”. Its sibling node just has two bits, “01”, which encodes “cd”.

Their common parent is the 6-symbol node emitting the “001100” bit string that we looked at earlier. We just worked out that the 4 symbols coded by its left (“0”) subtree are “brbr”, and the 2 symbols coded by the right (“1”) subtree are “cd”. That means the symbols coded by the entire node are “brcdbr”. Finally, we keep pushing this up one more level to merge in the a’s, and we get “abracadabra” back.

Fine, but why would we ever do this? This just seems like regular Huffman coding, but in batch-processing mode and with extra steps.

The answer is that we’ve transformed Huffman encoding into a sequence of in-order list partitioning steps, and decoding into a sequence of list merges where instead of doing comparisons, we get a bitstream telling us which of the two lists to take the next element from. Regular Huffman encoding isn’t so bad, but decoding it is, as discussed above, an extremely serial process. This type of list merging is not: figuring out which list elements to grab reduces to a prefix sum, which while not trivial, is the prototype for a class of parallel algorithm (scans) that we can do quite well, and it doesn’t need the bitstream to be written with a particular vector width (or similar) in mind. It easily scales both up and down with the capabilities of the target machine.

There are many other possible optimizations: for example, the tree I showed contains a complete binary tree for the 4 symbols “b”, “r”, “c” and “d”. Sure, you could encode (or decode) this as a sequence of 3 binary list partitions (or merges, on the decoder side), but you could also just send 2 bits per symbol at once and be done with it. Variable-length-to-fixed-alphabet decoding (like Huffman does) is tricky, but a subset that is fixed-length-to-fixed-alphabet is trivially parallelized. If we don’t want to scan over all of these lists to figure out how many “1”s there are, we can store this information for some or all of the Huffman tree nodes, spending extra bits, but saving the decoder having to scan over (potentially) tens of thousands of bits. And so forth.

So while the baseline scalar implementation of this idea is not very interesting, it sits squarely in the subset of tasks that can be mapped well to essentially any data-parallel substrate, be it GPUs, vector instruction sets of any stripe, or even just threads, which makes it worth playing with. And certainly the numbers attained in the paper are encouraging, with the major caveat that the main results are, not surprisingly, on either x86-64 servers with AVX-512 or Apple Silicon CPUs – i.e. very high-end hardware.

This bodes well for the future of the algorithm, but wire formats need to work for everyone, not just the highest-end targets.

Hence the question: how cheap can we get the basic primitives? In this post, I’ll be looking at the “merge” operations in the bottom-up decoder specifically.

The merge operation

Here’s what one of those merges boils down to, in Python-like pseudocode:

  def merge(l_list, r_list, bitstream):
      output = []
      l_pos, r_pos = 0, 0
      for bit in bitstream:
          if bit == 0:
              output.append(l_list[l_pos])
              l_pos += 1
          else:
              output.append(r_list[r_pos])
              r_pos += 1
      return output

That’s all it is. That’s what we’re working with. Note that this algorithm has the invariant that l_pos + r_pos equals the number of elements processed (and output elements written) so far, so really, tracking both of them is overkill. Either implies the other. And r_pos is the (exclusive) prefix sum of all bit values encountered so far.

As noted above, this essentially reduces to a mid-semester homework problem in parallel programming 101. It’s a very simple algorithm, which is good news for us because we really need something fast, and when something is this straightforward, there’s usually many different ways of doing it, and we can pick whatever works best on our current target. But we’ll need something a bit lower-level than our quasi-Python for the remainder of this. In particular, we need to pick some actual data types. In the following, I’ll be assuming that the symbols we’re trying to code are bytes, which is both practically relevant and consistent with my earlier posts on the subject. Our bitstream will be bit-packed in the obvious way.

The most direct match exists with something like AVX512_VBMI2, which offers VPEXPANDB, giving us a type of byte-level masked load operation where active SIMD lanes (where the corresponding mask bit is 1) load from sequential addresses and inactive lanes are either left alone or zero-initialized, without the load address increasing. It’s intended to load sparse data into a vector register and is essentially a perfect match for what we need. I’ll not use the official intrinsic names in the following because the notation is rather intimidating if you’re not used to it, but the process looks something like this, switching from Pseudo-Python to Pseudo-C:

  void merge(uint8* output,
      const uint8* l_list, const uint8* r_list,
      const uint8* bits, usize count)
  {
      // None of these code snippets will handle boundary cases like the
      // last few elements, nor will they do any bounds checking.
      // A real implementation absolutely needs to do this, but that's for
      // real code, not pseudocode in a blog post. You've been warned.

      // AVX-512 can process 64 bytes per iteration!
      for (usize i = 0; i < count; i += 64)
      {
          // Mask has bits set for elements coming from r_list
          // (LE order is natural here)
          uint64 mask = read64LE(&bits[i / 8]);
        
          // Read l_list entries
          // leave rest zero-initialized for now
          uint8x64 result = zero_masked_vexpandb_from_mem(l_list, ~mask);

          // Read r_list entries
          result = masked_vexpandb_from_mem(result, mask, r_list);

          // Store the result to the output buffer
          vstore(&output[i], result);

          // Determine population count (number of 1 bits) in mask to
          // figure out how much to advance the r_list and l_list pointers by
          uint64 npop = popcount(mask);
          r_list += npop;
          l_list += 64 - npop;
      }
  }

There’s a bit of futzing around to load the masks and advance the pointers, but this is pseudocode for a functional AVX512_VBMI2 implementation of the basic merge algorithm. Not necessarily optimal, but I’m not going to get that far into the weeds in this post. It has three actual vector instructions – two loads and one store, the rest is logistics.

There’s not much of a lesson here beyond “this sure is easy when your machine has a ‘do the thing’ instruction”. That’s why I won’t be spending any more words on AVX512 versions of anything in this post; it’s just our high anchor to prove that yes, with a restrictive enough hardware target, this turns into basically nothing.

The one thing I’ll say is that the actual prefix summing turned almost invisible in this version. VPEXPANDB implicitly does a kind of prefix sum over the active lanes to figure out which bytes to route where, and then at the end we do a population count over all 64 mask bits we handled in one go to figure out how much to move the list pointers by. This is going to be how things work.

In essence, all I’ll be doing in the rest of this post is figure out ways to get as close as possible to the loop above without requiring AVX512_VBMI2. On the x86 side, that means we probably want to have some option around AVX2 (256-bit integer vectors) and another around the SSE4.2 feature level. We can do something efficient for slightly lower-end targets as well (SSSE3 is about as low as we can realistically go here), but these are minor variations. On the ARM side, I’ll mostly look at NEON/AdvSIMD; I’ll leave SVE and RVV to people with access to relevant hardware.

The basic approach: 8 bytes at a time

SSE4.2 and NEON-level vector registers are much narrower than AVX512s name-giving 512 bits. For now, for our lower-end targets, we’re stuck with registers fitting 128 bits (or 16 bytes) at a time. We’ll see how to make use of that full width later, but for now, let’s keep things straightforward so we can familiarize ourselves with our tools.

Our main workhorses will be PSHUFB (introduced with SSSE3) on the SSE side and TBL on the ARM side (I’m only looking at 64-bit ARM here, the relevant instructions also exist in 32-bit ARM, although only producing 8 bytes of output at a time, but I’m working mainly on 64-bit code these days). TBL gets up to four (consecutive) source registers that are interpreted as a table of bytes, another source register that specifies a byte vector of indices into said table, and writes the results of all those table lookups into the destination register. Which is to say, it’s a generic any-byte-to-any-byte shuffle operation. Out-of-range index values produce the number 0.

PSHUFB in x86 is quite similar, with a few differences: first, the “table” is always a single 128-bit vector register (16 bytes). No more, no less. Second, only the bottom 4 bits select the table index; bits 4 through 6 are ignored. Functionally, that means table addressing is mod 16. Third, setting the MSB (bit 7) in the index vector does result in a 0 byte being returned in that SIMD lane.

These properties mean that one-128b-vector TBL in AArch64 and PSHUFB in x86-64 are mostly interchangeable, as long as we avoid indices between 16 and 127 (inclusive) where their behavior differs.

If we process 8 bytes at a time, that also means we process 8 bits from the bitstream at a time, i.e. a single byte’s worth, which is convenient. 8 bits means 256 different combinations, which is small enough that we can reasonably precompute a shuffle index vector for all possible combinations and just store them in a table. That table will be 256*8 bytes = 2KiB. And we can speculatively grab the next 8 bytes from both l_list and r_list, combine them into a single 16-byte (128-bit) vector, and use TBL/PSHUFB to index into it. This results in something like this algorithm:

  void merge(uint8* output,
      const uint8* l_list, const uint8* r_list,
      const uint8* bits, usize count)
  {
      // 8 bytes per iteration now
      for (usize i = 0; i < count; i += 8)
      {
          // Load 1 byte's worth of bits from bitstream
          uint8 in_bits = bits[i / 8];

          // Look up the corresponding index vector
          uint8x16 index_vec = index_vec_table[in_bits];

          // Speculatively, read 8 bytes from both lists
          uint8x8 l_bytes = vload64(l_list);
          uint8x8 r_bytes = vload64(r_list);

          // Concatenate them into a single 128-bit vector
          uint8x16 lr_bytes = vcombine(l_bytes, r_bytes);

          // Shuffle/table lookup
          uint8x16 result = vshuffle(lr_bytes, index_vec);

          // Store the result to the output buffer
          vstore(&output[i], result);

          // Determine population count (number of 1 bits) in
          // mask to figure out how much to advance the r_list
          // and l_list pointers by
          usize npop = popcount(in_bits);
          r_list += npop;
          l_list += 8 - npop;
      }
  }

I’m not showing the code to generate the index vector table, but all you have to do is prepare an 8-entry L list containing the values 0 through 7, a corresponding 8-entry R list containing the values 8 through 15, and then use the scalar merge algorithm with all 256 possible 8-bit patterns. The output lists are the index vectors to use to control the shuffle instructions. This basic approach will work for all table-based techniques in this post; I won’t be calling it out explicitly every time.

Also, we might not have a 8-bit popcount instruction available, but a small 256-entry table with byte values will do in that case. Another 256 bytes added to our 2KiB table budget won’t break the bank.

This is straightforward enough, and there is no immediately obvious way to improve it further. Our 256-entry 2KiB table of precomputed index vector is fine. The next larger step up that divides neatly into the operation sizes we have available is to work 16 bits at a time. That would require a 65536-entry table with 16B per table entry (since we’re now producing 16 output bytes), for 1MiB worth of tables. A 2KiB table comfortably fits in any relevant CPU’s L1D cache; 1MiB won’t fit in anyone’s L1, and if every table access is a L1D miss (possibly L2 miss as well), this loop’s performance will take a sharp nosedive.

In short, a single table lookup to retrieve our index vector is completely practical for 8 items at a time, but if we want to get up to 16, we need something different.

Towards 16 bytes at a time

We have another problem: 64-bit ARM has 2-register TBL, which can index into a table of 32 bytes in one instruction, and 2-register TBL executes as a single uop on many (though not all) implementations, so this is not only available, but also efficient. PSHUFB on x86, however, really only exists as a 16-byte version. (AVX2 VPSHUFB does two parallel 128-bit PSHUFBs; table entries are always fetched from within the same 128-bit slice of the vector the corresponding index value lives in.)

We can synthesize wider PSHUFBs out of regular ones using extra instructions (multiple PSHUFBs into smaller sub-tables and then a branchless select to choose the right sub-table), but this is relatively expensive. It’s much better to just handle the L and R list bytes separately: while the combined L and R list candidate bytes sum up to 32, we can consume at most 16 bytes from either list. And PSHUFB lets us generate 0 bytes if we set the corresponding index MSB. That means we probably want to set up into something like

    uint8x16 l_result = vshuffle(l_bytes, l_index_vec);
    uint8x16 r_result = vshuffle(r_bytes, r_index_vec);
    uint8x16 result = vor(l_result, r_result);

But now we need to come up with two index vectors. Or do we? If we set the MSB of the index vector, it doesn’t matter what the other bits are, PSHUFB (and TBL) will always return 0. And the items coming from the L list are always exactly in the spots where the R list items are not. Therefore, we can use the same index vector for both. We choose one list (in the following, that will be the R list) to use regular shuffle indices, while the other will use shuffle indices with their MSB set. After we use that index vector for one list, we can flip all the index MSBs via exclusive-OR, which gets us the index vector to use for the other list. That yields

    uint8x16 r_result = vshuffle(r_bytes, index_vec);
    uint8x16 l_result = vshuffle(l_bytes, vxor(index_vec, 128));
    uint8x16 result = vor(r_result, l_result);

Cool. A single bitwise operation is much better than having to come up with two separate index vectors. Speaking of which, we now have a plausible way of generating 16 output bytes at a time from a suitable index vector, but where do we get that index vector from in the first place?

For the first half of the index vector, we can use the same kind of table we already have. And, in theory, we could use the same table again for the second half, with just one problem: the second half doesn’t start at the beginning of l_bytes or r_bytes, it has to start right where the first half left off.

That means that if the first byte had say 3 set bits, the index vector for the second half needs to add 3 to every index into r_bytes, and 8 – 3 = 5 to every index into l_bytes. A straightforward implementation of this idea would need to:

  • do a table lookup for the first half of the mask (8 lanes worth)
  • do another table lookup for the second half of the mask (another 8 lanes worth)
  • determine the population count for the bitstream byte corresponding to the first 8 vector lanes (probably some ANDing or shifting to isolate that byte, then a table lookup or native popcount instruction where available)
  • get this population count over to a vector register and create at least 8 copies of it
  • for the second-half-of-vector index vector, determine a bit mask saying whether a given byte comes from the L or R half (we can tell from the index MSB)
  • add the replicated population count to all the “R” lane indices for the second half (this will usually by a bitwise AND followed by an ADD)
  • calculate 8-vector_popcount
  • add 8-vector_popcount to all the “L” lanes indices for the second half (AND-NOT and ADD, or similar)
  • finally, combine the two index vector halves

That’s a lot of work. Working 16 bytes at a time is clearly worth something, but as written this is at least 8 instruction equivalents more than processing the data in 8-byte chunks. Sure, we save on some pointer manipulation and such, but on the vector side, this looks pretty dire. We need something better.

We could tune the above a bit to improve it slightly, but saving an instruction here and there is not going to cut it. What we really need is a radical simplification.

Here’s the elevator pitch: we need to add that popcount (which I’ll call pop0) to the R indices, and 8 - pop0 to the L indices, and it is that sign difference that really causes problems for us. What if the L indices were negated, though? Then we would have to add -(8 - pop0) to them, which is pop0 - 8, which means we would be adding popcount to every lane (the -8 bias portion, we’ll get to in a minute).

Negating the L values doesn’t quite work, because the L index can be 0, and we need to be able to tell from the MSB whether a lane comes from the R or L list to make our “final shuffle” logic work out. But what we can do is agree on the convention that a byte that is supposed to come from r_list[i] stores i in its location in the index vector, and bytes from l_list[i] store 255 - i = i ^ 255 as an index.

For the shuffle indices in the second half, we still need to add pop0 to the R indices to account for what happened in the first half, which is straightforward. For the L indices, we need 255 - (i + (8 - pop0)) = (255 - 8) - i + pop0.

We still expect to get the shuffle indices for the second half from a table. If we make a separate table for that second half where the precomputed L indices are (255 - 8) - i not 255 - i (R remains the same), this doesn’t cost us any extra computation in the loop. And finally, note that pop0 only depends on the first mask byte that we already use for the first-half table lookup. If we make those table entries store 16 bytes not 8, we can just store 8 copies of pop0 in the second half of every table entry. Put it all together, and we end up with this pseudocode for our 16-bytes-at-a-time loop:

  void merge(uint8* output,
      const uint8* l_list, const uint8* r_list,
      const uint8* bits, usize count)
  {
      for (usize i = 0; i < count; i += 16)
      {
          // Load 2 byte's worth of bits from the bitstream
          uint16 in_bits = read16LE(&bits[i / 8]);

          // Two table lookups
          uint8x16 index0 = index_vec_tab0[in_bits & 0xff];
          uint8x8 index1 = index_vec_tab1[in_bits >> 8];

          // index1 needs to go into the second half of the 128-bit vector
          uint8x16 index1_wide = vcombine(vzero_uint8x8(), index1);

          // Add the two values together to perform the pop count adjustment
          uint8x16 index_vec = vadd(index0, index1_wide);

          // Speculatively, read the next 16 bytes from both l_list and r_list
          uint8x16 l_bytes = vload128(l_list);
          uint8x16 r_bytes = vload128(r_list);

          // Shuffle bytes to combine
          uint8x16 r_result = vshuffle(r_bytes, index_vec);
          uint8x16 l_result = vshuffle(l_bytes, vnot(index_vec));
          uint8x16 result = vor(r_result, l_result);        

          // Store the result to the output buffer
          vstore(&output[i], result);

          // Determine population count (number of 1 bits) in mask to
          // figure out how much to advance the r_list and l_list pointers by
          usize npop = popcount(in_bits);
          r_list += npop;
          l_list += 16 - npop;
      }
  }

As mentioned in the text, index_vec_tab0 has 256 16-byte entries, where the first 8 bytes of every table entry are the index permutation from doing a scalar merge of r_list = { 0, 1, ... , 7 } and l_list = { 255 - 0, 255 - 1, ..., 255 - 7 }. The second 8 bytes are just the population count of the table index repeated 8 times.

index_vec_tab1 as accessed in this code has 256 8-byte entries corresponding to the scalar merge results with the same r_list as before and a modified l_list = { 247 - 0, 247 - 1, ..., 247 - 7 }. So this version needs around 6KiB of tables, more than before but still a reasonably comfortable amount.

If you don’t want that “combine” operation (which usually itself turns into some kind of vector shuffle internal to the CPU), you can also make the second table have 16-byte entries where every entry has 8 bytes of zeros followed by the actual shuffle indices (now properly aligned in the second half). That means you save some shuffling but add another 2KiB of table size, and, it turns out, some extra uops from address computation. Both x86 and AArch64 implementations can usually multiply indices by 8 “for free” as part of load address generation, but not by 16, so this doesn’t normally save an instruction, it just trades a vector shuffle/shift operation for an address generation operation. In practice they’re usually close, but the smaller load tends to do a bit better, and between that and the 2KiB smaller table size, it’s the variant I prefer.

For a SSSE3/SSE4.2 target, this general approach is the best I’ve managed to come up so far. The exactly implementation details get a bit tweaky, and you might want to unroll some more to save on loop overhead, but this is the gist of it.

AVX2 still does, basically, this – now just on two 128-bit vectors at once. Due to the way AVX2 works (crossing 128-bit boundaries is hard and relatively expensive), this is your best bet. In effect, it’s functionally mostly an unrolled version of the 128-bit approach. Building the index vectors is really a 128-bit vector operation (arguably, it’s really a 64-bit vector operation and we’re already combining two 64-bit halves into a 128-bit vector in the computation above, although it may not look like it), and in AVX2 we do all that work twice, then pay an extra shuffle to merge the two 128-bit halves into a 256-bit register.

Likewise, we end up having to load l_list and r_list as two separate 128-bit halves, which is also no real win over working with 128-bit vectors. The only real 256-bit vector operations are in the “shuffle bytes to combine” and “store” stage. As a result, AVX2 is a bit faster than SSE4.2, but not by anything like 2x. It’s closer to 1.2x for me, and even that much took a lot of care.

I’ve looked into other approaches for AVX2 that don’t use any big tables at all and compute the shuffle vectors instead. These approach are more “naturally 256-bit” if that makes sense, and they do have better vector width utilization, but doing it this way also takes several extra instructions, and while I managed to get fairly close to the performance of the table-based approach, I never managed to beat it. I might write it up at some point.

Ironically, I am positive that approach would easily come out ahead once you step up to 512-bit vectors, but there’s just not much reason to. With AVX512_VBMI2, we have a full hardware solution that we’re never going to beat in software, and “AVX512 but not VBMI2” is a relatively small niche at this point of almost exclusively server CPUs that are 6+ years old and tend to drastically down-clock when 512-bit vector code is used anyway. Maybe some day? But I haven’t spent much though on it.

The NEON special

The algorithm as written above works just fine on NEON, and its performance is good. There’s nothing wrong with it.

It’s just that it happens we can do a bit better on machines where 2-table 128-bit TBL is fast (which includes Apple Silicon and newer high-end ARM Cortex/Neoverse cores). Our load-shuffle-and-store portion can be just this (and this time I’m using straight NEON intrinsics because this is actually NEON-specific):

    // Load source bytes
    uint8x16x2_t src;
    src.val[0] = vld1q_u8(r_list);
    src.val[1] = vld1q_u8(l_list);

    // Shuffle to form result
    uint8x16_t result = vqtbl2q_u8(src, index_vec);

    // store results
    vst1q_u8(dest, result);

Okay, that part is very nice, but how do we get the index vectors for this? We still want fetches from r_list to use positive indices and fetches from l_list to use negative indices so that our “add pop0 to everything” gambit continues to work. But, if we’re using 2-vector TBL, we also need the final l_list indices to be 16-based, because the l_list bytes go into indices 16 through 31 of the table.

So, this time, our index vectors are actual int8s, and store either i for a byte coming from r_list[i] or -(16 + i) for an entry coming from l_list[i]. We can just use this convention, and keep the table setup otherwise the same as in the previous step. Right before we do the shuffle, we want to negate our table indices for L list entries, which boils down to taking the absolute value of everything. That’ll leave the (positive) r_list indices alone and negate the (currently negative) l_list indices like we need.

Therefore, if we did that, we would end up with something like

    // Load index vector parts
    int8x16_t index0 = vld1q_s8(index_vec_tab0[in_bits & 0xff]);
    int8x16_t index1 = vld1q_s8(index_vec_tab1[in_bits >> 8]);

    // Form merged control
    int8x16_t summed = vaddq_s8(index0, index1);
    uint8x16_t index_vec = vreinterpretq_u8_s8(vabsq_s8(summed));

This is assuming we use the variant with full 128-bit table entries. As before, we can make the second table have 64-bit entries, making it smaller but taking one more vector shuffle-type operation at runtime (usually vextq).

The final trick is realizing that NEON has an instruction for |a[i] – b[i]| for signed values, and what we need is |a[i] + b[i]|. But b[i] comes out of a table that we control. We can just negate that entire table, and now the final index vector is two loads and a single arithmetic instruction, giving this full merge operation:

    // Load index vector parts
    int8x16_t index0 = vld1q_s8(index_vec_tab0[in_bits & 0xff]);
    int8x16_t index1 = vld1q_s8(index_vec_tab1[in_bits >> 8]);

    // Form merged control
    uint8x16_t index_vec = vreinterpretq_u8_s8(vabdq_s8(index0, index1));
    
    // Load source bytes
    uint8x16x2_t src;
    src.val[0] = vld1q_u8(r_list);
    src.val[1] = vld1q_u8(l_list);

    // Shuffle to form result
    uint8x16_t result = vqtbl2q_u8(src, index_vec);

    // store results
    vst1q_u8(dest, result);

I think this 16-byte merge operation might actually be capital-O Optimal on current NEON implementations. The loads from the two lists and the store are strictly necessary; using a single 1-uop (on many uArchs) instruction to do the byte shuffling is the minimum amount of work we can do once we have those source vectors; I’ve explained above that a 1MiB LUT from full 16-bit indices is not practical; if we can’t do it in a single table load, it has to be at least two; and if we have two separate table loads, there needs to be at least one instruction to combine them.

There is some uncertainty around the rest of the loop (figuring out the popcounts, the pointer advancing, and such), the “control plane” so to speak, but I think the “data plane” side of this operation is airtight.

And I think that’s it for today! I won’t be linking to any code of my own for this, since I’ve been talking to Marcin, he’s tried the things I described in this post out in his GitHub repository for PivCo-Huffman and confirmed they’re faster in his tests, so just go there if you want to see this actually implemented. They might not be up immediately depending on when you read this, but they should land shortly.

Simple batch decoding of unary codes

Unary codes are a form of universal variable-length code (UVLC) that are sometimes used on their own but more commonly used as a building block for more general families of UVLCs like Golomb, Rice or Gamma/Exp-Golomb codes. There are multiple conventions in use, the one I’ll use in the following has codewords terminated with a “1” bit and is 0-based, i.e. the codebook goes

0 -> 1
1 -> 01
2 -> 001
3 -> 0001
4 -> 00001

and so forth. That is, some value i is encoded by sending i “0” bits and a single “1” bit at the end. You sometimes see the opposite convention (variable-length run of 1s terminated by a single 0), but having the runs be 0 bits is a bit nicer in code. The main reason is that “count leading/trailing zero bits” instructions are readily available on most machines these days; these can also be used to count leading/trailing ones by doing a bitwise complement first, but that extra work disappears when we use 0s in the first place.

In this post, I want to consider the case where we’re decoding a bunch of unary-coded values in a row. For the sake of concreteness, let’s consider an actual decoder that uses LSB-first bit packing and a Little Endian byte stream, using variant 4 from “Reading bits in far too many ways (part 2)”. That decoder will usually look something like this on a 64-bit platform:

  // refill
  uint64 next = read64LE(bitptr);
  bitbuf |= next << bitcount;
  bitptr += (63 - bitcount) >> 3;
  bitcount |= 56;

  if ((bitbuf & ((1ull << 56) - 1)) == 0) {
      return error("too many 0s in a row");
  }

  // unary code value = trailing zero count
  uint64 code = ctz64(bitbuf);
  decoded[i] = code;

  // consume bits
  // code len=coded value + 1 bit for trailing 1
  uint64 len = code + 1;
  bitcount -= len;
  bitbuf >>= len;



This should be fairly straightforward. Unary codes are usually expected to not be too long, and a common mistake is to not check properly and then run into overflow issues or UB when such inputs show up, hence the explicit limit in this version. Theoretical UVLCs can encode any non-negative integer; in practice, any format I’ve ever seen that doesn’t enforce hard limits has nasty (and frequently exploitable) overflow bugs. If you really do need to support very large codes, I would recommend still writing your loops for the expected case (short runs, unless you’re using a unary code in the wrong place) and treating long runs as an exception that you handle more carefully.

Anyway, back to our topic. The code above works, but it’s relatively expensive. As many bitstream decoding tasks, this is completely serial, and we also have a lot of overhead in the bitstream handling. The unary code corresponds to a geometric distribution with success probability 1/2, so the expected code value works out to 1, and hence the expected number of bits consumed every iteration is 2. But we still need to do the refill handling every iteration because any given code could be a long one.

In the usual setting where unary codes are arbitrarily interleaved with other codes, it’s hard to do much better than this. For example, even a basic Rice code alternates unary prefixes with a fixed-length suffix, and that combination destroys a lot of valuable structure that the constituent parts (namely, unary and fixed-length codes) have on their own. Table-driven decoders can accelerate common cases and even handle multiple symbols at once (provided their codewords are short enough), but often at the cost of increased logistical difficulties, and the fundamentally serial nature of decoding with the critical path through most of the decode stays the same.

By contrast, if we can assume a sequence of pure unary codes with nothing in between, things immediately get simpler. For example, if our bit buffer contains at least 2 set bits, we know that we can decode two code words before needing to refill, which lets us unroll the loop and amortize the refill overhead. This is cheaper than it sounds because this test, using the well-known trick to clear the lowest set bit (if any), boils down to:

  // clear lowest set bit in bitbuf:
  uint64 next = bitbuf & (bitbuf - 1);
  if (next != 0) {
      // we know we have at least two set bits in bitbuf,
      // and can use ctz on "bitbuf" and "next" to determine
      // the next two code values in parallel.
  } else {
      // only one set bit in bitbuf, do single decode as before
  }



This strategy works and does in fact result in an almost perfect 2x speed-up without needing a table-based decode (which would add at least one L1D access’s worth of time to the critical path between iterations), but I won’t spend any more time on this here, because with a pure unary code stream, we also have a much better option.

Tunstall-style decoding

Prefix codes map from a known input alphabet to variable-sized (in general) code words. Tunstall coding is the dual approach that maps a variable number of input symbols to code words of a fixed size. Decoding typically uses a table that has space for the maximum number of symbols emitted, plus their count.

Unary codes are not actually a “natural” Tunstall code; code words can get arbitrarily long. But it’s trivial to “glue them together”. Let’s see what happens when we try to just grab a fixed number of bits—8 being the most convenient choice by far, since it means we just process bytes at a time—and attempt to do a table-based decode. I’ll write our 8 individual bits from left to right and label them with letters so you can follow along for the explanation below:

0,1,1,0, 0,1,0,0
a b c d e f g h

Suppose we don’t remember what came before it, and we haven’t yet looked at what comes after that byte. What can we say about these unary code values?

At position “a”, our current run of zeros gets extended by 1, and at position “b”, we have the end of a code. We don’t know what the actual value being coded is; that depends on how many 0s preceded this byte, which we pretend to not remember for now. We’ll get back to this.

At position “c”, we have another 1 bit, immediately after “b”, so we know that’s for sure a value of 0. Likewise, at “f”, we have another 1, and this one was preceded by two “0” bits, so we also know with certainty that it’s the end of a codeword for the value 2. Finally, positions “g” and “h” start a run of at least 2 zero bits, but we don’t know how long it’s going to get until we look at the next byte (and possibly more bytes after that).

Okay, so we don’t quite have a regular Tunstall code, because we need some sort of carried state between codewords. But it’s not far off: from those 8 bits, we can definitely tell that we’re going to get 3 values out of decoding this byte, we know how many extra 0s need to get added to the tally before emitting that first code, and we know how many 0s we had left over at the end that will go into some future code that terminates at the next “1” bit in the stream, wherever that ends up being.

Note that each unary code (in our formulation) contains exactly one “1” bit, so the number of code words emitted is always the same as the population count of the input byte.

What this tells us is that we can build a 256-entry table that encodes what to do with each byte value. In that table, we store up to 8 small integers (uint8 is plenty, since the actual values contained in the table are all below 8) and the “carry-out” (the length of the final run of 0s that hasn’t found a 1 yet). We also either store the number of output values or calculate it via a “population count” instruction, if available. One special case: every byte value that’s non-0 has at least one value emitted from it, meaning the “carry” gets reset somewhere in the middle. If the byte is 0, however, the current run of 0s extends by another eight bits without emitting a code. Putting this all together, we get a decoder loop that looks something like this:

  uint8 byte = *bitptr++;
  if (byte != 0) {
      // at least 1 bit set -> we emit values
      TableEntry tab = g_unary_table[byte];

      // first value has carry added
      decoded[i] = carry + tab.value[0];

      // remaining values are copied directly
      for (j = 1; j < tab.count; j++) {
          decoded[i + j] = tab.value[j];
      }

      i += tab.count;
      carry = tab.carry;
  } else {
      // run of zeros: no code emitted just yet,
      // add 8 to current carry
      carry += 8;
      if (carry >= 57) {
          return error("too many 0s in a row");
      }
  }


That give us a different way of accomplishing the same thing we already had. But this one doesn’t involve any variable bit IO. And while it looks more complicated, it actually has a lot of potential for further simplifications.

The optimized version

This time, I’ll give the final version first, and then explain the changes to get there:

  uint8 byte = *bitptr++;
  if (byte != 0) {
      uint64 values = g_unary_table[byte];
      // emit values from table, adding carry into the first value (bottom byte)
      write64LE(decoded + i, values + carry);
      // next carry is in top byte of table
      carry = values >> 56;
      // number of values emitted is the population count
      i += popcnt32(byte);
  } else {
      carry += 8;
      if (carry >= 57) {
          return error("too many 0s in a row");
      }
  }


First, I made the table just be a table of uint64 values, and assume that decoded is an array of bytes with a bit of padding, so we can safely overshoot the expected number of code values by at least 7 without writing out of bounds.

Second, I am explicitly writing these up to 8 values using a single Little Endian 64-bit write. Third, the first value being emitted, which needs the carry correction, is in the low-order byte of values, so we can simply add carry to values to do the correction. The values contained in the table are never larger than 7 (because the first set bit in the byte can have at most seven 0s preceding it before we run out of bits in the byte!) and carry is guaranteed to be less than 57. The table values for carry are also at most 7 and the byte == 0 case explicitly ensures that it doesn’t get larger than that. Therefore, this addition never overflows out of the low byte.

Finally, fourth, we store the next carry value in the top byte of the table entry, purely so our table entries don’t need to get bigger than a single uint64. This looks a bit weird at first glance: can’t we have 8 values from the same byte, if the byte was all-1 bits, i.e. 0xff? Indeed, we can, but that is the only case where we emit 8 values, and in that case the 8 values being emitted are 0 and the carry is also 0. So yes, the two table fields are double-booked in that one case, but since they contain the same value, it doesn’t matter.

Okay, that’s a pretty short decoder loop, and it’s an interesting approach, but what did that buy us?

Well, this turns out to be much faster than the one-at-a-time decoder. The one-at-a-time decoder produces one value per iteration. This one consumes input bits at a fixed rate of eight bits per iteration, and assuming these bits are approximately uniformly distributed (as they should be, unless you’re using the unary code in a setting where it’s inappropriate), the expected number of set bits in each byte is four, meaning that each iteration of our loop averages four values written, not just one.

Does this mean we should expect around a 4x speed-up over the one-at-a-time version?

Actually, no! Because this formulation has an ace up its sleeve. Note that it always consumes a single input byte each iteration, and advances the input cursor by one byte. And the relatively-high-latency table access (usually on the order of 4-5 cycles for a L1 hit these days, compared to 1 cycle latency for an arithmetic operation) only depends on the byte value we loaded, and not on anything else in this loop.

If we look at the dataflow graph for the whole process, note that the critical paths for the loop-carried variables go as follows:

  • bitptr (the input cursor) just keeps adding 1 to previous iteration’s value; 1 cycle latency. An out-of-order CPU will know, within a single cycle of starting the current iteration, where to fetch the byte for the next iteration.
  • carry for the next iteration depends on values for this iteration (loaded from the table), but the only thing it does it get added to the new values in the next iteration. In the common path (when the byte read wasn’t 0, which occurs about 99.6% of the time) there’s a new definition of carry every loop iteration. The only thing that depends on this iteration’s carry is the next iteration’s store, subsequent iterations don’t care, so carry actually isn’t on the critical path at all.
  • i, the output index, gets incremented by the population count of byte, but all the byte loads are independent of each other (they only depend on bitptr), and i only flows into the stores, which nothing directly depends on. The length of the carried dependency is only the length of the add, 1 cycle again. The remaining latency is off the critical path; we might need to pipeline across several iterations to hide the latency of the population count, but there’s nothing that stops out-of-order CPUs from doing exactly that.

In short, the critical iteration-to-iteration path for this decoder is a single clock cycle! There’s a lot more work that happens, but it’s not on the critical path to starting new iterations; an out-of-order CPU with sufficient potential for instruction-level parallelism can go hog wild on this loop.

What does this mean in practice? Well, I benchmarked both the Tunstall-style decoder and the one-at-a-time decoder above, and on my home machine (a Zen 4 CPU), the Tunstall-style decoder is approximately 9x faster. We’re not talking small fry here; the difference is substantial.

So what?

This may seem like a random curiosity, but it’s not, because there’s something much deeper at play here.

Over the past three decades or so, we’ve gotten huge increases in computing power, but they’re all critically dependent on parallelism. Be it instruction-level parallelism that is what out-of-order CPUs set out to extract, thread-level parallelism that is unlocked by multi-core CPUs, or the massive exploitation of data parallelism that is at the heart of GPUs and SIMD instruction sets.

One thing I’ve kept running into during the past 10 years is that we have ridiculous amounts of compute power available to us, but a lot of our data formats are set up all wrong to benefit from any of it. It’s not like nobody is noticing – if you look at say the VLDB conferences, there’s been a steady stream of papers figuring out ways to put all those sources of parallelism to use for the past several decades – but it feels like large parts of the data compression and “data wrangling” world dealing with (de)serialization are stuck in the same rut they’ve been in since at least the 1970s, with variable-sized encodings that mix everything into a single stream absolutely everywhere. At best, these take heroics to extract a little parallelism; more commonly, it’s either not possible, or at least not economical, to do so at all.

What this post is trying to demonstrate, more than anything else, is just how much we’re giving up by approaching things this way. Granted, there are other factors at play than speed here, but I would argue that not only is the Tunstall-style decoder for unary codes simpler, it is also easier to make safe, and at lower relative cost. Decoding multiple values per iteration means that any bounds-checking overhead etc. is automatically amortized, so there’s much less incentive to try something fancy and subtle that might backfire later. And note we didn’t even do any SIMD or threading to get there! We got around 4x just from an approach more suited to the problem (to wit, most unary codewords are really short, so we want something that can benefit from this) and more than another 2x just because that new approach happens to also play to the strengths of current machines, whereas fully serial decoding very much does not.

Going back to the motivating examples: straight unary codes are probably not something you see a lot. But something like Rice codes or Exp-Golomb/Gamma codes does show up in scenarios like database indices and posting lists.

There are slick decoders for these that are almost exactly the same cost as just the unary decode portion. I’ve posted about some of these long ago. But you have many more options once you split them into their constituent pieces.

For Rice codes, they decompose into a unary prefix and a fixed-size suffix. As this post shows, we can decode a bunch of pure unary values much quicker than we can a stream that also has other data inside it. A batch of fixed-size codewords is likewise trivial (and can even support arbitrary random access). So, if we have a lot of Rice-coded values, it’s a very good idea to separate out the unary prefix portion from the suffix portion, splitting the whole process into 1. decoding all the unary prefixes in one go and 2. running an extra pass that glues on the suffix bits; this extra pass has completely linear memory access patterns, no data-dependent control flow, and is trivial to optimize.

Something like Exp-Golomb is a bit more complicated, because the suffix portion is variable length, but note that separating the two halves pays dividends here too: knowing all the lengths up front means the suffix loop is not constantly waiting for a L1D memory access to count-leading-zero operation or similar to finish. We once again end up with a loop where we only add a number to a bit position as a loop-carried dependency and everything else can overlap. Or, if we want to get more ambitious, we can realize that we can determine many “bit read” offsets in parallel once we have the predetermined lengths in an array, because now we can do a prefix sum computation on the lengths to get the exact position in the bit stream where each value’s suffix bits start. And suddenly we have something that is amenable to SIMD implementation. (Oodle has been using this approach on parts of the Kraken bitstream for 10 years now.)

Don’t get me wrong, I don’t mean to suggest that everything needs to look this way, but I keep seeing serialization formats that don’t even consider this kind of approach at all, as well as a steady stream of people discovering just how much room for improvement there really is from either changing implementations of wire formats to use more parallelism-friendly approaches or changing these formats to be less inherently serial in the first place.

Not everything needs to be particularly optimized, and batch processing is not suitable for every task, but it is simply astonishingly effective on current computing hardware, and sadly under-utilized.

Appendix: table generation

I did not put this in the body text because it’s fairly straightforward and I wanted to concentrate on the way the decoder works, but for what it’s worth, here’s how the tables for my test program were generated:

  for (int byte = 0; byte < 256; byte++)
  {
      U64 remainder = byte | 256; // add a dummy 1 bit at top
      U64 shift = 0;
      U64 table_entry = 0;

      for (;;)
      {
          // determine next code value (we always have a set bit,
          // so this always works)
          U64 code = ctz64(remainder);

          // shift that bit out
          remainder >>= code + 1;

          if (remainder == 0)
          {
              // if there was only 1 set bit left, that's the sentinel
              // we inserted; our code is the carry.
              //
              // if and only if byte==255, we already have a real value
              // at that position, but in that case, both that final
              // value and the carry are 0, so it doesn't matter.
              table_entry |= code << 56;
              break;
          }
          else
          {
              // not yet done; add this value to the list
              table_entry |= code << shift;
              shift += 8;
          }
      }

      g_unary_table[byte] = table_entry;
  }



Why does ASTC use ISE when almost nothing else does?

The ASTC texture compression format has its “integer sequence encoding” to send small integers with a uniform probability distribution within their range. When that value range is [0,2k-1] for some integer k, this is straightforward: just send the values with k bits each. But ISE ups the ante by supporting not just power-of-2 sizes ranges, but also allowing a single prime factor of either 3 or 5 in the size of the range. So we can, for example, have some value x in the range [0,95], 96 total, where the low 5 bits are sent regularly, and then we have the “high trits” x/32\lfloor x/32 \rfloor that are always in the range [0,2]. When the extra prime factor is 3, these trits are arranged in groups of 5. The number of possible combinations of 5 values in the range [0,2] is 35 = 243 < 256, so we can encode our 5 values in 8 bits. For 5 values in the range [0,95], that means we spend 5*5 + 8 = 33 bits; had we just used regular binary instead, we would have probably rounded up to 7 bits per value, so our 5 values would take 5*7 = 35 bits. There we go, 2 bits saved over 5 values, about 2/5 = 0.4 bits per symbol, so why even ask the question in the title?

The actual answer is at least slightly more interesting: because of course in practice this method isn’t competing with just rounding up to the nearest integer bit multiple, we could also use a prefix code. With 3 values, which in the following I’ll write as “a”, “b” and “c” to avoid confusion with actual bit strings, there is only one real choice of prefix code: one of the values gets sent with a 1-bit code and the other two values each get a 2-bit code. Something like

a -> 0
b -> 10
c -> 11

You can change which of the three codes gets the 1-bit code length, but it’s always one of them. Our assumption is that all three values are equally likely, so the expected cost of a value ends up being 1/3 * (1 + 2 + 2) = 5/3 bits, or about 1.67 bits per symbol. The ISE encoding that sends 8 bits for every group of 5 values averages 8/5 = 1.60 bits per symbol. (For reference, the theoretical minimum is log2(3) ≈ 1.585 bits per symbol.)

And suddenly the difference looks a lot less drastic. Compared to a straightforward prefix code, the expected savings shrink from 0.4 bits per symbol down to about 0.07 bits per symbol, so we expect to save a single bit slightly less often than one in every 14 values.

ASTC blocks are not large. They have 128 bits, total, and of the modes that use ISE at all, at least 17 of those bits are, effectively headers. With at most 111 bits left for payload, the largest number of trit values we can even theoretically send in a block, assuming we stick with the minimum possible [0,2] value range works out to… 69 (I swear I didn’t fudge this). If you ever managed to cram that many values into a block (not actually possible in the format for reasons that don’t matter here), ISE’s expected savings over the bog-standard prefix code work out to under 5 bits, and because we have to round to an integer size at block boundaries, that means the trits can only ever save a maximum of 4 bits over a really elementary prefix code, in the most extreme case. In practice, most blocks don’t actually have that many values, and many of them will in fact not even make it over the threshold where the expected saving crosses 1 bit!

Now, I hopefully have you convinced that the question in the title is at least somewhat meaningful. Now clearly, eventually, with enough data, the savings from something like ISE are definitely noticeable. It’s just that ASTC blocks are quite short and can’t send that many values to begin with; what do we care about the long run if we can never get there? Sure, we might save 1-3 bits over the block as a whole, but that’s in a range where we possibly could’ve used a slightly more compact encoding in the header fields, maybe gotten rid of some rarely-used options, and gotten those extra bits that way, instead of dealing with the extra complexity that ISE brings. (Which is, on the whole, not that big a deal, other parts of ASTC are also rather complex, but it’s still a question worth asking.)

And now that I’ve convinced you that this is not quite as trivial a question as it may seem, time to fess up: I’ve not been entirely honest with you in this post. I haven’t said anything outright wrong (at least, not intentionally), but I’ve been using the word “expected” multiple times, and that is doing a lot of heavy lifting here.

It’s true, for uniformly distributed random input data, we expect the a ↦ 0, b ↦ 10, c ↦ 11 prefix code (or one of its permutations) to average 1.67 bits per symbol. But it depends on the data! If all the trits we send happen to be a’s, we’ll actually only use 1-bit codes and send 1 bit per symbol for that exact input sequence. Conversely, if there’s not a single ‘a’ in the data we want to encode, we’ll actually end up using 2 bits per symbol. With ISE, the size doesn’t depend on the data. You need to send n trits, I can guarantee you, sight unseen, that ISE will encode those in 8n/5\lceil 8n/5 \rceil bits, no matter what the values are. And that is the real reason to use something like ISE in ASTCs context; it’s not so much about the superior packing density compared to what a simple prefix code could do, it’s about the predictability. ASTC actually really leans on this, in that it infers some coding parameters from however many bits are left after the headers and all the other data in the block have been accounted for. The predictability is essential.

So, now to answer the full question in the title: ASTC uses ISE because it works a block at a time, has very few bits to play with, and really benefits from the size of an ISE code stream being completely predictable without knowing the data. It’s not that the expected cost of using a simpler prefix code would change things that much by itself, it’s that the lack of predictability would complicate encoders even further and end up requiring extra header bits to compensate for the variability that would add to the cost.

Conversely, the reason we don’t see more ISE-like things elsewhere is that prefix codes are, on average, quite good and most applications are perfectly content with that. The ones that don’t want to or can’t afford either variability or the small expected waste are likely to take the plunge into “full” arithmetic coding (or related techniques) instead.

Addendum: quints

I didn’t go into the other (quint) encoding which needs a set of 5 values which I’ll call a through e. 53 = 125 < 128 so three quints fit into 7 bits with almost no waste, giving 7/3 ≈ 2.33 bits per quint, vs. the ideal log2(5) = 2.32 bits. The corresponding prefix code will assign something like

a -> 00
b -> 01
c -> 10
d -> 110
e -> 111

with two 3-bit codes and three 2-bit codes, making the expected cost (2*3 + 3*2) / 5 = 12/5 = 2.4 bits per symbol. Once again, about 0.07 wasted bits per symbol, so the rest of the discussion doesn’t materially change. The maximum number of quints that can theoretically fit into the at most 111-bit payload section of an ASTC block is 1113/7\lfloor 111 \cdot 3 / 7 \rfloor = 47, making the expected bit savings from ISE on a maximum-length list of values 47 * 0.07 bits = 3.29 bits, which rounds down to 3 (we can’t use the partial bit).

Content creator

Nobody should voluntarily call themselves that.

“Content” is the language of people on the distribution side of things. If you look at something like a 19th century newspaper it’s mostly a logistics exercise. Sure you may think of just back issues on microfilm in national archives or something like that, but the actual business was not the journalism; it was printing, distributing and ultimately selling a folded-up bunch of big sheets of paper to a large number of buyers at regular intervals.

It turns out this is only actually a viable business if those pages contain something that people are willing to pay some amount of money to read, hence: “content”!

Print and other mass media have changed substantially in the past 150 years, but this part hasn’t changed: “content” is what the people who build and run the logistics side call what gets distributed.

I don’t think there’s anything inherently wrong with that, mind. I even happen to be one of those people, having spent pretty much my entire professional life so far on “content pipelines” in one form or another.

But that language is your hint right there. Staying in the pipeline metaphor, the way I interact with content in my day job as a kind of undifferentiated sludge with mildly caustic characteristics that has an alarming tendency to gum up, corrode and spring leaks in pipes that I’m supposed to keep in working order. But that doesn’t mean that the people creating said content™ should think of it in the same terms.

Humans breathe in air and breathe our slightly warmer air that we sometimes even vibrate on the way out for our own reasons, but you wouldn’t call a trained singer a “hot air creator” to their face, would you? What is this faux-technical overly-detached nonsense?

If you’re talking about someone else, “content creator” has the feel of “I couldn’t even be bothered to figure out what it is this person makes”, and if you’re talking about your own work, then you’re either fully alienated from your own creative output, or you’re unthinkingly adopting the vantage of someone who views your work as fundamentally fungible and interchangeable.

And again, this viewpoint is not inherently bad. For example, in my day job, if some texture makes the data cooking pipeline hiccup, I truly do not care about what that image is supposed to mean in the context of a larger work or whatever. (I rarely even look at the things, it’s mostly a matter of metadata and processing flags being set.) As a “content pipeline plumber”, the nature of the job is that I engage with most of the art I see in my job not at all, or only superficially.

But if you’re the one making it, then – fuck no (speaking as someone who has also made, and continues to privately make, art on his own). “Content creator” is just not an acceptable self-descriptor. Have some self-respect for crying out loud. If your work is so heterodox as to defy any easy categorization, all the better. You make “strange experimental art” or whatever. But “content”, please no.

Oodle 2.9.14 and Intel 13th/14th gen CPUs

There’s a hardware problem affecting Intel 13th/14th gen CPUs, mostly desktop ones. This made the rounds through the press last year and has been on forums etc. for much longer than that. For months, we thought this was a rare bug in the decoder, but from stats in Epic’s crash reports for Fortnite (as well as stats for other Unreal Engine and Oodle licensees) it was fairly striking that this issue really seemed to affect only some CPUs.

Much has been written about this problem already. At RAD we have an official page about it, which includes a link to an older version (when we still weren’t sure what was causing it).

Intel’s statement on the issue confirms that it’s a hardware problem where there’s physical degeneration of the clock tree circuitry over time resulting in clock skew which ultimately makes affected CPUs behave in increasingly glitchy ways.

This is a hardware issue, and Intel has now released multiple versions of microcode updates to try and prevent or at least limit that degradation. But the symptoms are just frequent crashes or errors in certain components. For Unreal-using games, the most common symptoms are

  • Shader decompression failures (as mentioned on our official page)
  • Shader compilation errors in the graphics driver
  • Spurious “out of memory” errors reported by the graphics driver (even when there’s plenty of GPU and system memory available)

These shader (and sometimes other) decompression errors are where Oodle (the data compression library I work on) comes in. Specifically, we notice during decompression that the data we are currently decoding is corrupted because internal consistency checks trip and return an error code. Most commonly, these errors occur as a result of either corruption of on-disk data, or bad RAM. But in this particular instance, we were (and still are) seeing these errors on machines with RAM that passes memory tests just fine and where the on-disk data passes hash validation.

We just released Oodle 2.9.14 which includes an experimental workaround that seems to at least reduce the frequency of failed Oodle decompression calls on affected machines. We’re not yet sure whether it helps with overall crash rate on those machines or if crashes happen with more or less the same frequency, just elsewhere. We’ll have more data in a few months! But meanwhile, the background for that workaround is interesting, so I’m writing it down.

Background: spring 2023 to early 2024

I’ll keep this history section relatively brief, because this is not the main point of this post.

We first started hearing about Oodle crashes on the then-new Intel 13th-generation CPUs around March 2023, concurrently from three separate sources: Fortnite was seeing a big spike in crash reports especially due to shader decompression failures (as mentioned above), with a smaller spike of errors in non-shader decompression; other UE licensees as well as Oodle-but-not-UE licensees were reporting something similar in their crash reports; and we were even getting a few mails from end users asking us about frequent Oodle crashes. (This is rare since RAD/Epic Games Tools is B2B, we don’t normally interact with end users directly).

We did notice that all of the reports seemed to involve recent Intel CPUs. From there follows months of red herrings and dead ends as well as trying (and failing) to reproduce these issues consistently. Condensed version:

  • Our first lead was that we had seen a lot of instability with a bunch of new demo machines at a recent event. Those occurred on a specific motherboard/CPU combination, and ultimately turned out to be due to bad sound drivers (which corrupted the contents of vector registers in user-mode code, yikes). The first batch of reports we had were all from one single motherboard vendor, so we thought that might be it.
  • Then we had a report from an UE licensee, still on the same motherboard, but removing the faulty sound drivers didn’t seem to help. They later reported that their crashes went away after they disabled AVX support in the BIOS. Between this and the vector register content corruption for the bad sound drivers, we started going over all SIMD code in Oodle Data with a fine-toothed comb.
  • By June, we still hadn’t found anything useful; we did have some experiments that disabled vector/SIMD usage in the decoder entirely but that didn’t seem to help, so back to square one.
  • Around July, so forum posts were making the rounds that disabling the E-cores seemed to help. We also had some reports from other motherboard vendors at the time, ruling out our initial theory that it might just be one.

We still didn’t have any machines that reproduced this behavior locally (Epic uses mostly AMD CPUs in dev machines); we were in contact with Intel since at that point we were reasonably sure it was a HW problem. All of the code in question had been shipping with very little changes since 2017 – Oodle changes over time are mostly to the compressors, as well as port to new platforms. The decoders don’t change that often. The spike in crashes was also still limited to certain newer CPUs.

By October, we found an Epic employee experiencing this crash when running UE games on his private home machine. Having a mixture of graphics shader compilations on some threads with the Oodle decompression calls on some others seemed to be key (and that still seems to be the recipe today). We tried to find a reproducer that is less complicated than entire games, but with very limited success. We did eventually manage to get an infuriatingly inconsistent Oodle-only reproducer that would usually, but not always, produce a crash within 10 minutes of completely hogging the machine. We learned a few things from that, including confirmation that SSE/AVX usage seemed to be a red herring – we could turn off usage of SSE/AVX loops entirely and we’d still get crashes at the same rate, but didn’t make a lot of progress.

We did, however, notice that various of the stock BIOS settings, especially regarding CPU voltage, current limits and frequency scaling, were what I can only describe as completely insane. 500W power and 500A current limits on parts with 253W TDP, BIOS defaults aggressively overclocking past the already aggressive default base/boost clocks, etc. So for a while we thought this was mostly a problem with aggressive motherboard defaults driving the relevant parts well outside their intended operating range.

All the while the issue was picking up more and more steam; in late 2023, we started putting up our web page at RAD noting the aggressive motherboard defaults and relaying instructions from Intel on what they should be set to instead. More users were hitting this all the time, and the tech press was starting to notice. And Intel had recently released the 14th gen parts which were suffering from the same issue.

A few months later it was clear that changing the BIOS settings was, at best, a temporary fix; affected systems continued to be unstable even after the setting changes, and Intel started looking into their own voltage/frequency scaling logic and issuing new microcode updates trying to work around the issue every few months. By late summer of 2024, we had the confirmation from Intel (linked above) that the symptoms were caused by physical circuit degradation of the clock tree circuitry.

Not something we can do much about in SW. For that matter, once machines show these symptoms, their HW is physically damaged and needs replacement. The best we could hope for with Microcode updates and less aggressive BIOS defaults was to prevent the replacement parts from developing the same problems within a matter of months.

So, that’s the background up to middle of the last year; but I mentioned we just released Oodle 2.9.14 with an experimental workaround. What’s that to do with anything, and what changed from last year?

A semi-consistent repro, at last

In spring of 2025, we were contacted by a customer about some Oodle decode issues in their game that seemed random and looked like they might be a race or memory stomp.

I’ll spare you the details, but this does come up fairly regularly and it’s usually use-after-frees or similar on the app side, so we had a few weeks of back and forth to rule out the most likely causes.

At the end of that we had a smoking gun: a reasonably consistent test setup using their game where, on decompression failure, they were just retrying the exact same call with all the same input and output buffers again, and that would often work. For context: it is expected that, on a large enough sample of end user machines, decompression calls sometimes fail, be it because the data on permanent storage is actually corrupted or due to issues like bad RAM etc. In any sufficiently large sample, you’re going to get some of those. But, generally speaking, these errors happen some distance away from the CPU. Having one CPU core fail to decompress 256k of data with an error, and then frequently having the same core succeed trying to decompress those same 256k again (or having them fail again, but with a different error), rules out those causes, because on the second attempt on the same core, that data is usually coming from the L2 or at most L3 cache, which unlike consumer DRAM has fairly strong error correction everywhere these days.

It’s not impossible for a context switch or core migration to coincide with the retry (or close enough), but expecting these to line up frequently enough for “just retry it” to be an effective remedy is a stretch at best.

More to the point, retries didn’t seem to help everywhere. They seemed to help only on one particular type of test machine – you guessed it: 13th/14th gen Intel CPUs.

The customer was getting crashes at a fairly consistent rate when starting up their game. Again, the mixture encountered during loading, with lots of shader compilation plus Oodle decompression, seemed to be key. And their reasonably consistent repro involved overclocking the CPU slightly (by about 5%) – to be fair, technically out of spec, but also a common thing for gamers to do.

Anyway, this was definitely a reason to do some more debugging. Rare decompression errors caused by what seems to be actual data corruption are not particularly actionable, but repeated decompression of the same chunk of data either failing in different ways on subsequent attempts, or sometimes succeeding or sometimes failing, was definitely something we needed to look into.

Especially since it seemed like that customer might have finally found the thing we’d been trying and failing to find since spring of 2023: a reproducer for what seemed to be the Intel 13th/14th gen bug, consistent enough to do some actual debugging with. Full disclosure: we do not know whether this issue is the same as the spurious decompression errors that users with affected machines have been seeing. But at the very least, the symptoms are very similar.

Evidence!

Oodle Data decompression needs a bit of workspace memory. By default we just allocate and free that ourselves, or you can allocate that slice up front and pass it into Oodle to avoid the allocation churn. As part of the earlier triage process, we had already asked the customer to switch to statically allocated workspace memory – the idea being that if the problem was caused by workspace memory corruption due to a use-after-free or similar, moving our workspace to statically allocated memory that was never freed would have likely fixed the issue. (And made it clear what issue in the calling code to look for.)

Anyway, passing in the workspace memory manually also means the caller knows where it lives, and has its contents even after a failed call. This is great, because most of the “interesting” decoder data is in that workspace memory and in the event of a decode error, we have a few hundred kilobytes of data that contain a memory image of all the larger decoder data structures at time of failure. We don’t get a stack trace, but we don’t generally need one because Oodle Data has very predictable control flow and also logs error messages; given the log message for the error, we know exactly where in the decode flow we were when the error was first detected.

I’ve written prior blog posts on how Oodle Data works. The important part being that instead of one monolithic decoder, we run a sequence of small, specialized loops that are more modular, easier to optimize, and easy to mix-and-match as desired. The basic currency that most of these building blocks work with is arrays of bytes, and these arrays of bytes go (among other things) into that workspace memory.

In short: all we needed to get some actionable information was a dump of the decoder workspace memory after some failed decodes. The customer was able to provide multiple examples of 10 failed decodes in a row, all failing in slightly different ways. This was fantastic – we could compare the workspace memory contents with known successful decode runs and hopefully get some more information about what was actually happening. This involved writing a small impromptu Python script to load workspace memory dumps and compare them against known-good reference decoder state.

Even just the log messages were already surprising, though: none of the failed decode logs we were sent had any evidence of a bitstream desync. Normally, if you’re going to poke around in a compressed data blob then try and decompress it, that’s what’s most likely going to happen – at some point, there’s some difference that causes the decoder to get “out of phase” with the encoder, maybe read a few bits too few or too many, and from then on it’s a cascade failure where the rest of the bitstream (up to the next sync point, anyway) decodes to complete garbage and you get some “unexpected end of file” or “invalid header” or similar error.

This literally never happened in the logs of failed decode attempts we were sent. Not once. That is, by itself, incredibly suspicious. There were errors that caused the LZ portion of the decode to go out of sync – matches with the wrong offsets, or the wrong lengths, things like that, which then ultimately result in a desync at a later stage (producing the app-visible decode error). But none of the bitstream reading ever failed in that way. We always made it fully through bitstream decoding then noticed inconsistencies later.

More to the point: when diffing the decoded bitstream contents against the expected values, there were never any big diffs either. Nor were there any “burst errors”. It was always single individual bytes being corrupted, then the next 7000-10000 bytes would be completely fine (and correct) before we got another single-byte error.

This also made it unlikely that we had a memory stomp or similar. In principle these can be anything, but in practice, in a memory stomp, you’re usually seeing at least 4 adjacent bytes being overwritten with a value that clearly doesn’t belong – usually a 32-bit or 64-bit write, because those are the most common access sizes. It possible to get individual 1-byte values stomped, with thousands of bytes in between them, it’s just not very likely. You would need awfully specific memory access patterns in the writer to get that kind of thing.

Finally, the bytes that were corrupted always seemed to contain small byte values, always between 1 and 11 in fact.

Values between 1 and 11, you say?

After I made that last observation, I had a hunch that soon proved to be correct: all the arrays with incorrect decoded data were Huffman coded (itself not surprising since that’s our most common entropy coder choice by far), and all the bytes that were corrupted were not corrupted arbitrarily – instead, instead of the value assigned to each Huffman code word, the corrupted bytes stored its length (in bits) instead. I have this post describing how Huffman decoding in Oodle works, but the part you need to know is just this: we normally use a 11-bit length limit for codes, and decode using a 2048-entry table whose entries look like this:

struct HuffTableEntry {
    uint8_t len; // length of code in bits
    uint8_t sym; // symbol value
};

And I’ve already written yet another post on the details of the Huffman decoder, so I’ll skip to the end: on x86-64 CPUs with BMI2 support, the 4-instruction sequence we use to decode one byte is this:

  andn  rcx, rMask, rBits             ; peek (rMask = ~2047)
  movzx ecx, word [rTablePtr + rcx*2] ; table fetch
  shrx  rBits, rBits, rcx             ; bit buffer update
  mov   [rDest + <index>], ch         ; emit byte

If you can’t read x86 pseudo-assembly: in C-like code, these 4 instructions correspond to

HuffTableEntry e = table[bits & 2047]; // andn, movzx
bits >>= e.len; // shrx
dest[index] = e.sym; // mov

That’s all we do to decode a single byte. We do this for 6 bitstreams in parallel (to hide latency) and unroll it a bunch of times, and every few bytes decoded there’s some other things that need to happen that don’t matter for this post, but that’s the key decode step.

Well, it seems that on the affected machines, and in this particular test setup (which includes running that game and overclocking the machine a bit) some of the time (ballpark, once every 10000 bytes decoded), that store doesn’t store CH (bits [15:8] of RCX), but CL (bits [7:0] of RCX) instead.

Oof.

That was the theory, anyway. Not hard to check: instead of using the store high, we can throw in an extra shift and then store from CL instead:

  andn  rcx, rMask, rBits             ; peek (rMask = ~2047)
  movzx ecx, word [rTablePtr + rcx*2] ; table fetch
  shrx  rBits, rBits, rcx             ; bit buffer update
  shr   ecx, 8                        ; shift out len
  mov   [rDest + <index>], cl         ; emit byte

We sent a version of Oodle with that decoder to the customer and asked them to re-run the test, and yup, that seems to fix the transient decode errors for them, at least in their test setup.

So, that’s the workaround we’re shipping on Oodle 2.9.14: on the affected CPUs, throw in the extra shift instructions to avoid the stores from “byte high” registers. This is slightly slower, but with the emphasis on slightly – this particular loop was originally optimized for Haswell Generation Intel CPUs (2013). x86 CPUs have gotten wider since, supporting more instructions per clock cycle, and the main speed limiter in that loop on newer Intel CPUs is the latency along the dependency chain (see the linked article on the 6-stream Huffman decoders for analysis). Adding an extra instruction and an extra cycle of latency on the non-critical “store byte” edge is not the end of the world on these newer CPUs. It’s a minor decode speed regression, but on typical Kraken data, it’s about 0.5% slower, which is not that bad. (And the workaround is only necessary on those specific CPUs.)

Conclusion: what actually causes this?

In the hardware? Well, I don’t really know, so this is all speculation; we’re well into the weeds of uArch implementation details here.

I talked about this with some friends and our best guess boils down to the following: the data we have, as originally loaded, is just in RCX, and that store needs to store bits [7:0] if it’s a regular byte store or bits [15:8] from that register if it’s one of the rare 8-bit high byte stores. The actual register number here almost certainly doesn’t matter, this is all renamed after all; but an x86-64 CPU that supports the high byte stores needs to be able to select which byte of a register a byte store is actually writing to memory.

That means there’s gotta be a bunch of multiplexers somewhere, these multiplexers get some control signal telling them which byte to store, and that control signal seems to have very little timing slack. At least on some parts, not enough. If they’re overclocked (or at the high end of their “turbo boost” clock range), sometimes, that control signal doesn’t make it in time, and what’s stored is instead the low half of the register.

This is almost certainly not the only signal that is cutting it close, but this workaround seems to help for Oodle decoding, at least for this customer. Whether this counts as a real fix, I’m doubtful; the affected machines are already also crashing in other places (see list at the top of the post for other examples). I think best-case, this might reduce crash rate on the relevant machines somewhat, but probably not in a way that actually leads to a good user experience. It still seems worth trying.

For now, this Oodle release is brand new and hasn’t been rolled out in games (as of this writing). We’ll know in a few months whether it actually helps reduce crash rates.

Another thing we were wondering was whether the hand-written ASM code in Oodle was the problem. After all, this is using a pretty niche x86-64 feature, and the high byte registers aren’t actually used much. But a quick test on godbolt.org shows that Clang and GCC end up using the exact same trick for a C++ version of the decode dependency chain. They do end up using 5 instructions per byte though, not 4, because unlike the code that ships in Oodle, they use regular AND, not ANDN (which has a 3-operand form). Either way, the code that actually ships in Oodle is not exactly what any mainstream x86-64 compiler would produce, but using the 8-bit high registers (the esoteric feature in question) is not the distinguishing factor here, mainstream compilers use that too.

This bug made zero sense to me for the longest time, and once you have strong evidence that a misbehaving CPU is involved, debugging it purely from the SW side is pretty miserable since CPUs are complex, full of hidden internal state and generally black boxes. So it was pretty exciting to get these data dumps that indicated that what might be happening is not only comprehensible, but in fact was very suggestive of the underlying problem.

How useful this actually ends up being, and whether it actually helps users in the wild (or just this special and somewhat contrived test setup) remains to be seen, but I’m hopeful.

UNORM and SNORM to float, hardware edition

I mentioned in a previous post that doing exact UNORM or SNORM conversions to float in hardware was not particularly expensive, but didn’t go into detail how. Let’s rectify that! (If you haven’t read that post yet, please start there if you need an explanation of what the UNORM and SNORM formats are, as well as the basic idea.)

Previously, I explained how for n ≥ 1,

\displaystyle \frac{x}{2^n - 1} = \sum_{k=1}^\infty \frac{x}{2^{nk}}

using the specific case n=8, but the proof is identical for other n. For UNORM8, SNORM8, UNORM16 and SNORM16 we need the cases n=8, n=7, n=16 and n=15, respectively. This is already good news, because this boils down to just a number of bit shifts (for the divisions by powers of two) and adds. In the cases we care about for UNORM/SNORM conversions, we also have 0 ≤ x ≤ 2n – 1, meaning it’s actually even simpler than adds, because we have n-bit values shifted by multiples of n bits, making the bit ranges non-overlapping, so there’s not even real additions involved, we’re really just repeating a bit string over and over. Therefore, even though expanding our division by 2n – 1 into an infinite series looks scary at first sight, it gives us a straightforward way to get the exact quotient to as many significant digits as we want.

With the scary-looking division step turning into something fairly benign, this looks fairly doable, but there’s a few special cases to take care of first.

Special values

Ultimately, our goal is to turn our integer inputs into floating-point numbers. In this post I’m assuming we want to convert to the IEEE 754 binary32 format, other output formats work similarly though. That means we need to determine the correct values for the sign, exponent and significand fields.

Both UNORM and SNORM formats have a few particular values that need some extra care for various reasons. For UNORM8, the two special values are the extremal ones: 0x00 (which turns into 0.0) and 0xff (which turns into 1.0). Both of these produce all-zero significand bits and exponent values that don’t occur for any of the other values (biased exponent values of 0 and 127, respectively), and it’s easiest to treat them as special cases.

UNORM16 is similar, with the special values being 0x0000 (turns into 0.0) and 0xffff (turns into 1.0). Also, UNORM8 and UNORM16 are very closely related. Converting a UNORM8 value x to float is exactly the same as converting the UNORM16 value (x << 8) | x to float. (This works because 65535 = 2562 – 12 = (256 – 1) * (256 + 1) = 255 * 257 by the usual difference-of-squares formula, so instead of dividing x/255 we can also divide x*257 by 65535.) Since this is just repeating the same byte value twice, one possible, and entirely sensible, implementation of both UNORM8- and UNORM16-to-float conversions is to treat the former as the latter with the single byte input replicated twice.

The SNORM formats are slightly more complicated: the divisors 127 and 32767 don’t have a common factor, so they’re actually distinct cases as far as we’re concerned. We also need to handle the sign, and there are a few more special values to consider. For the latter, in the SNORM8 case, 0 and 127 (bit patterns 0x00 and 0x7f) turn into 0.0 and 1.0 respectively, and we gain two additional inputs, -128 and -127 (bit patterns 0x80 and 0x81 respectively) that both map to -1.0. For SNORM16, the relevant bit patterns are 0x0000 -> 0.0, 0x7fff -> 1.0, { 0x8000, 0x8001 } -> -1.0.

IEEE754 binary32 float is a sign-magnitude format; we therefore also need to convert the two’s complement SNORM input values into separate sign and magnitude, which boils down to extracting the sign bit (MSB) and computing the absolute value of the input integer. The latter requires a 15-bit adder for SNORM16 that is also sufficient for SNORM8.

For both UNORM and SNORM formats, this takes care of all the special cases and leaves us with a fraction x / (2n – 1) where we know that x is nonzero and the quotient is not an integer, since we just took care of the cases with exact integer results.

The general case

That leaves the remaining cases, where we need to determine the exponent and the significand rounded to the target precision. I’ll work through it in detail for UNORM16, and add some final remarks at the end on how to handle the SNORM8 and SNORM16 cases. (UNORM8 is trivial to reduce to UNORM16, as pointed out above.)

For UNORM16, conceptually we want to compute x/65535.0 with infinite precision and then round the result to the nearest representable binary32 float, or whatever our target format is. All binary32 floats representing UNORM16 values are outside the subnormal range, which makes things fairly easy. The main thing is that we need to align the significand so it starts with a 1 bit to determine the correct exponent. We know x≠0, so there is definitely a 1 bit somewhere in x’s binary expansion. Therefore, normalizing x boils down to doing a leading zero count and shifting x accordingly:

// x has at most 16 bits (covers SNORM8/SNORM16 cases too)
num_lz = clz16(x)
normalized_x = x << num_lz

In practice, normalized_x and num_lz can be computed at the same time, not serially. One straightforward divide-and-conquer approach goes like this, but there are others:

normalized_x = x
num_lz = 0

if (normalized_x & 0xff00) == 0:
    num_lz += 8
    normalized_x <<= 8

if (normalized_x & 0xf000) == 0:
    num_lz += 4
    normalized_x <<= 4

if (normalized_x & 0xc000) == 0:
    num_lz += 2
    normalized_x <<= 2

if (normalized_x & 0x8000) == 0:
    num_lz += 1
    normalized_x <<= 1

This is not particular to UNORM->binary32 float conversions; regular uint16->binary32 conversions perform the same normalization step (and can easily be handled on the same datapath if desired).

Afterwards, normalized_x is known to have its MSB (bit 15) set, so we can treat it as a constant 1, which simplifies things very slightly.

Next, we compute the tentative exponent (which turns out to be the final exponent because there are no rounding cases that would change it). The largest UNORM16 value we handle in this path is 0xfffe, which falls into the [0.5,1) interval, and has a binary32 biased exponent of 126. Any extra leading zeros decrement that exponent further, so we get

exponent : uint8 = 126 - num_lz

and our calculation earlier told us that to get x/65535.0, we just need to keep repeating the normalized mantissa. Sans implicit leading 1 bit, that means we get 23 bits worth of significand by concatenating

initial_significand = { normalized_x[14:0], normalized_x[15:8] }

However, we still need to perform rounding. If we’re doing truncation (or rounding towards negative infinity, which is the same for non-negative numbers), this significand is final. If we’re rounding towards positive infinity or away from zero, we’re always adding 1 – normalized_x is not 0, which means there are always non-0 digits to the right of our significand, which means we should always round up in this case. That leaves round-to-nearest, which needs to look at the next-lower unrounded fraction bit, which in our case comes from normalized_x[7]. If that bit is 1, we need to increment initial_significand, otherwise we leave it as it is, meaning the final significand calculation for round-to-nearest is:

final_significand = initial_significand + normalized_x[7]

In short, all the standard IEEE rounding modes are easy and just require a single add at the end. This does not require a full-width adder: we don’t need to handle the case where the 16-bit normalized_x is all 1s, since that’s one of the special cases we handled outside. Since the bottom bits of initial_significand all come from normalized_x (with wrap-around since we keep repeating it), and we only ever add at most 1, we know carries can’t propagate arbitrarily far – any carries will be absorbed after at most 15 bits, since we eventually have to run into at least one originally 0 bit in normalized_x. This is slightly more convenient than the situation with normal int-to-float conversions, where rounding can end up propagating a carry all the way across every significand bit and into the exponent bits. For UNORM or SNORM conversions, our initial guess at the exponent is always correct; the only case where rounding would ordinarily cause us to increment the exponent is the inputs corresponding to -1.0 and 1.0, but we handle those as special cases elsewhere.

For UNORM8/UNORM16, what’s the damage in total? A bit of multiplexing up front to handle UNORM8 and UNORM16 inputs both (to replicate the input byte into two copies turning UNORM8 into UNORM16), the detection and handling of the two special inputs 0x0000 and 0xffff, the 16-bit leading zero count/normalization, a 16-bit adder for the rounding, and the rest is wiring. Nothing particularly expensive!

Dealing with SNORM formats

I won’t work through it in detail for SNORM8/SNORM16, but the same idea applies. For these formats, there’s a few more special cases to handle up front, and (as I’ve described it) we need to also detect the sign of the input (trivial, it’s just in the operand MSB) and compute the absolute value (in effect, a 16-bit adder).

After that, the significand datapath works the exact same way as the UNORM case. If we want to support directed roundings, we now have a sign bit to worry about, but that’s bog standard IEEE stuff and not hard. If we want a datapath that can handle all four of { UNORM8, UNORM16, SNORM8, SNORM16 }, probably the easiest way to do it is to place the SNORM8/SNORM16 absolute-value inputs into the top bits of the 16-bit leading-zero-count/normalize path.

Below that, we repeat the mantissa bits with a period of 15 bits for SNORM16 and a period of 7 bits for SNORM8 – a tiny bit more muxing. The rounding works exactly the same and the argument for why rounding only needs an adder covering the low 16 mantissa bits also stays the same.

And that’s it!

Does anyone have reason to care about this that doesn’t already know it? Probably not. I just thought it was fun to write up. I also wrote a small C++ test program to demonstrate that it this approach really gives the correct results.

MRSSE

For BC6H encoding in Oodle Texture, we needed a sensible error metric to use in the encoder core. BC6H is HDR which is more challenging than LDR data since we need to handle vast differences in magnitude.

BC6H internally essentially treats the float16 bits as 16-bit integers (which is a semi-logarithmic mapping) and works with those integers throughout. It’s a bit more complicated than just grabbing the float bits, but not in a way that really matters here. The most straightforward approach is to do the same thing in the encoder, pretend we care about those 16-integers, and just work with squared errors etc. on those, which is easy to do and also quite fast.

This works reasonably well with white(-ish) light and not very oversaturated colors, but degrades horribly especially with bright, colored emissive light. My colleague Jon Olick wrote a post showing some examples back when we originally released Oodle Texture. The problem with working on log-space RGB data boils down to this: say we have some bright red glowing thing and the RGB tuples for a pixel (in whatever scaling we choose) are something like (2.34, 0.02, 0.01). Working with the logarithms per-channel in this scenario tells us that getting the value in the green channel off by 0.001 would be about as bad as getting the value in the red channel off by 0.1, so we end up producing large net brightness shifts to avoid imperceptible hue shifts, a bad choice.

We still would prefer something squared error-ish in the encoder core: the derivative of a squared error is linear, so minimizing a squared error by finding its critical points turns into solving a linear system. We find ourselves doing this hundreds of times for millions of pixels each, so a simple linear system is good.

I’ll cut to the chase, the main distance metric we use in Oodle Texture for HDR data is what we internally call MRSSE, “Mean Relative Sum of Squared Errors” (somewhat unfortunately different from the similar but distinct notion by the same name used in statistics). In essence we wanted something like our usual squared error (aka SSE, Sum of Squared Errors, or SSD, Sum of Squared Differences), but normalized to be more like a relative than absolute error, so we used

\displaystyle \text{MRSSE}(x,y) = \frac{\|x - y\|^2}{\|x\|^2 + \|y\|^2}

The numerator is self-explanatory, the denominator was chosen to avoid divisions by zero if either vector is nonzero (in our case, there’s also an extra tiny bias term on the order of the smallest normalized float16 to avoid divisions by zero) and to make the difference symmetric. The latter is not required, but convenient. In our case, x and y are 3-component vectors. The normalization here is somewhat arbitrary, it would be perfectly sensible to throw an extra factor of 2 in the numerator to normalize by the average squared length of x and y, not the sum of their squared lengths, but we only ever care about comparing errors with each other, an overall scaling of 2 changes nothing, so we went with the simpler form.

Since this is a BC6H encoder, most commonly one of the two inputs, say x, is held constant (coming from the source image) while we try many candidate choices for y. We do this a lot, and in this case we expect x \approx y anyway (after all making the two close is our objective), so we use the asymmetric relative squared error

\displaystyle \frac{\|x - y\|^2}{\|x\|^2}

instead (again, with a tiny bias term in the numerator to avoid division by zero). This has the huge advantage that since x is fixed, we can compute all the 1 / (\|x\|^2) values for a block once up front, and then just have a regular weighted squared error (with a per-pixel weight factor) per pixel. That’s what’s used in our inner loops. It makes “minimize this error”-type optimization problems actually linear, which is extremely handy.

We did not attempt to do a full formal evaluation of different error metrics. We did, however, try something like 8 or 9 different metrics in the early stages of writing the BC6H encoder (this would’ve been mostly 2019/2020) and had a surviving 3 metrics (one using ICtCp-ish variant using the SMPTE 2084 PQ transfer function, one using the “natural” albeit problematic semi-log2 encoding implied by BC6H, and this one) carried through all the way into shipping Oodle Texture, until we ultimately decided to ship with just the one always on because we never found an image that really benefited from anything else. Of the metrics we evaluated, this ended up being computationally one of the cheaper ones and simultaneously the best in terms of visual result quality in informal but careful testing; it was not close. The semi-log2 encoding is the cheapest by far, but its problems with highly saturated bright colors ultimately disqualified it.

This one was good, still fairly simple, and gave us a result quality that (in my opinion anyway) is appreciably better than all other BC6H encoders I’ve seen.

Exact UNORM8 to float

GPUs support UNORM formats that represent a number inside [0,1] as an 8-bit unsigned integer. In exact arithmetic, the conversion to a floating-point number is straightforward: take the integer and divide it by 255. 8-bit integers are for sure machine numbers (exactly represented) in float32 and so is 255, so if you’re willing to do a “proper” divide, that’s the end of it; both inputs are exact, so the result of the division is the same as the result of the computation in exact arithmetic rounded to the nearest float32 (as per active rounding mode anyway), which is the best we can hope for.

The D3D11.3 functional spec chickened out here a bit (granted, this verbiage was already in the D3D10 version as I recall) and added the disclaimer that

Requiring exact (1/2 ULP) conversion precision is acknowledged to be too expensive.

For what it’s worth, I had reason to test this a while back (as part of my GPU BCn decoding experiments) and at the time anyway, all GPUs I got my hands on for testing seemed to do exact conversions anyway. It turns out that doing the conversion exactly is not particularly expensive in HW (that might be a post for another day) and certainly doesn’t require anything like a “real” divider, which usually does not exist in GPUs; correctly rounded float32 divisions, when requested, are usually done as a lengthy instruction sequence (plus possibly an even lengthier fallback handler in rare cases).

A conversion that is close enough for essentially all practical purposes is to simply multiply by the float32 reciprocal, i.e. 1.f / 255.f, instead. This is not the same as the exact calculation, but considering you’re quantizing data into an uint8 format to begin with, is almost certainly way more precision than you need.

What if, as a matter of principle, we want an exact solution, though?

Option 1: work in double (float64)

Not much to say here. Converting the 8-bit value x to double, multiplying by 1. / 255., then converting the result to float32 happens to give the same results as the exact calc (i.e. dividing x / 255.f). This is trivial to verify by exhaustive testing. It’s just 256 possible inputs.

Option 2: but I don’t want to use doubles either!

Picky, are we? OK, fine, let’s get to the meat of this post. What if we’d like an exact conversion, would prefer to not use a divide, and would also prefer to avoid using doubles?

This is the fun part. First, let’s turn 1/255 into a geometric series:

\displaystyle \frac{1}{255} = \frac{1}{256} \frac{1}{\frac{255}{256}} = \frac{1}{256} \frac{1}{1 - \frac{1}{256}} = \sum_{k=1}^\infty \frac{1}{256^k}

and therefore

\displaystyle \frac{x}{255} = \sum_{k=1}^\infty \frac{x}{256^k}

Going from a single divide to an infinite series might seem like it’s making our problems worse, but it’s progress. The divides by 256k are easy because those are powers of 2, hence exact (and, incidentally, non-overlapping bit shifts). Of course we can’t sum infinitely many terms, but we also don’t need to. All we need is an approximation that gets close enough to round to the right value. How many series terms is that?

The hardest to round cases turn out to be powers of 2, x=2k, k = 0, …, 7. They’re all basically the same, so let’s stick with the easiest example x=1 (k=0). Float32 has 24 effective significand bits (only 23 stored, since a leading 1 is implied). To round correctly, we need 24 correct bits after the leading 1 bit (23 actual bits plus the first truncated bit for rounding)… and then actually another term after that to make sure we don’t accidentally hit an exact halfway case. Our infinite series never terminates, so we need that extra term for non-0 values to make sure the sticky bit gets set pre-normalization to indicate unambiguously that the value ought to round up (and not break the tie towards evens).

Each term contributes 8 bits, and in our x=1 case, normalizing the floating-point number shifts out the first term entirely. In short, we need 1 “sacrificial” term that might get mostly normalized away (in the power-of-2 cases), 3 more terms for the bulk of the mantissa, and then 1 more term after that to avoid the halfway-case problem and ensure our sticky bit gets set correctly come rounding time.

So am I proposing summing 5 terms of a series? That doesn’t sound fast! And, fortunately, no. We don’t need to sum terms one by one; if we’re going to be multiplying anyway, we might as well combine multiple terms into one. Here’s the actual solution I’m thinking of (note this assumes compilation with proper IEEE semantics, not fast math shenanigans):

// Constants here written for posterity but assumed
// to be evaluated at compile time.
// All values exactly representable as float32.
const float k0 = (1.f + 256.f + 65536.f) / 16777216.f;
const float k1 = k0 / 16777216.f;

// suppose x is integer in [0,255], then:
float ref = x / 255.f;

// is exactly the same (in IEEE-754 float32 with RN) as
float tmp = (float)x;
float no_div = (tmp * k0) + (tmp * k1);

That first calculation of tmp * k0 creates 3 back-to-back copies of x. That’s 24 bits of significand which is (just) exactly representable in float32 thanks to the implicit 1 bit. tmp * k1 creates another 3 copies (and thus terms of the series) scaled down by 2-24, which is likewise exact (it’s the same as the first term except for the exponent). That means we end up with two machine numbers, and adding them together has one final rounding step – the only rounding we’ll do. 6 series terms is more than we needed (and in fact 5 works fine), but the constants are easier to write down with 6 terms instead of 5 and it doesn’t really matter.

Final tally: after the int->float conversion, two multiplies and an add. When FMAs are available, the second multiply and the final add can also be replaced with a FMA, meaning this comes in at two FP ops with FMAs available, and 3 without. If there’s a way to do the exact conversion with less, I don’t know it.

UPDATE: Alexandre Mutel sent in a very slick suggestion on Mastodon based on a completely different approach that boils down to

// Constants evaluated at compile time.
const float k0 = 3.f;
const float k1 = 1.f / (255.f * 3.f);
return ((float)x * k0) * k1;

The realization here simply is that while the float32 reciprocal of 255 isn’t accurate enough to produce correct rounding, the reciprocal of 255*3 is, and multiplying integers in [0,255] by 3 is (easily) exact in float32. In this case I wrote the multiplication by 3 after floating point, but it can of course also be done on the integer side (which was Alexandre’s original suggestion); whichever is more convenient. This is two multiplications after the int->float conversion, even when no FMAs are available, so I think that’s our new winner.

BC7 optimal solid-color blocks

That’s right, it’s another texture compression blog post! I’ll keep it short. By “solid-color block”, I mean a 4×4 block of pixels that all have the same color. ASTC has a dedicated encoding for these (“void-extent blocks”), BC7 does not. Therefore we have an 8-bit RGBA input color and want to figure out how to best encode that color with the encoding options we have. (The reason I’m writing this up now is because it came up in a private conversation.)

In BC1/3, hitting a single color exactly is not always possible, and we have the added complication of the decoder being under-specified. BC7 has neither problem. If we look at our options in table 109 of the Khronos data format spec, we see that mode 5 stores color endpoints with 7 bits of precision per channel and alpha endpoints with a full 8 bits, so it’s a very solid candidate for getting our 8-bit colors through unharmed.

The alpha portion is trivial: we can send our alpha value as the first endpoint and just use index 0 for the alpha portion (mode 5 is one of the BC7 modes that code RGB and A separately, similar to BC3), which leaves the colors. Can we use the color interpolation to turn our 7 endpoint bits per channel into an effective 8?

We have 2 color index bits per pixel. Indices 0 and 3 are not useful for us, they return the dequantized endpoint value, and those are 7 bits, so that only gives us 128 out of 256 possible options for each color channel. Index 1 is interpolated between the two at a 21/64 fraction; index 2 is symmetric (exactly symmetric in BC7, unlike the situation in BC1), i.e. it’s the same as using index 1 with the two endpoints swapped, and therefore doesn’t give us any new options over just using index 1. That means we only need to consider the case where all index values are 1: if the value we need in a color channel happens to be one of the 128 values we can represent directly, we just set both endpoints to that value, otherwise we want select a pair of endpoints so that the value we actually want is between them, using that 21/64 interpolation factor (subject to the BC7 rounding rules).

For BC1, at this point, we would usually build a table where we search for ideal endpoint pairs for every possible target color. For BC7, we can do the same thing, but it turns out we don’t even need a table. Specifically, if we build that table (breaking ties to favor pairs that lie closer together) and look at it for a bit, it becomes quickly clear that we can not only hit each value in [0,255] exactly, but there’s a very simple endpoint pair that works:

// Determine e0, e1 such that (BC7 interpolation formula)
//   target == (43*expand7(e0) + 21*expand7(e1) + 32) >> 6
// where expand7(x) = (x << 1) | (x >> 6)
e0 = target >> 1
e1 = ((target < 128) ? (target + 1) : (target - 1)) >> 1

And that’s it. Do this for each of the R, G, B channels of the input color, and set all the color indices to 1. As noted earlier, the A channel is trivial since we just get to send a 8-bit value there to begin with, so we just send it as one of the endpoints and leave all alpha index bits at 0.

This is exact and the encoding we’ve used in all our shipping BC7 encoders, regular and with rate-distortion optimization both. Often there are many other possible encodings using the other modes, especially for particularly easy cases like all-white or all-black. In our experiments it didn’t particularly matter which encoding is used, so we always use the above. The one thing that does matter is that whatever choice you make should be consistent, since solid-color blocks frequently occur in runs.

Why those particular integer multiplies?

The x86 instruction set has a somewhat peculiar set of SIMD integer multiply operations, and Intel’s particular implementation of several of these operations in their headline core designs has certain idiosyncrasies that have been there for literally over 25 years at this point. I don’t actually have any inside information, but it’s fun to speculate, and it gives me an excuse to waffle about multiplier designs, so here we go!

MMX

x86 doesn’t have explicit SIMD integer operations before MMX, which first showed up in – no big surprise – the Pentium MMX. Said Pentium MMX offers exactly three SIMD integer multiply instructions, and all three of them originally took 3 cycles (fully pipelined).

The first and most basic one is PMULLW, “packed multiply low word”, which interprets its two 64-bit MMX register operands as containing four words (which in x86, if you’re not familiar, means 16-bit integers) each. The corresponding lanes in both operands are multiplied and the low 16 bits of the result written to the corresponding lane of the destination. We don’t need to say whether these integers are interpreted as signed or unsigned because for the low 16 bits, it doesn’t matter. In short, it’s a basic element-wise multiply working on 16-bit ints.

The second available integer multiply is PMULHW, “packed multiply high word”. Again, we multiply 16-bit lanes together, which (in general) yields a 32-bit product, and this time, we get the top 16 bits of the result. This time, we need to make up our mind about whether the integers in question are considered signed or unsigned; in this case, it’s signed. A fun fact about “high multiply” type operations (which exist in a lot of instruction sets) is that there’s no practical way (at least, not that I’m aware of) to compute just those high bits. Getting those high bits right generally means computing the full product (in this case, 32 bits per lane) and then throwing away the bottom half. Therefore, a datapath that can support both types of multiplies will usually end up having a full 16×16->32-bit multiplier, compute all product bits, and then throw half of them away in either case.

That brings us to the third and last of the original Pentium MMX’s multiply-type instructions, and the most fun one, which is PMADDWD. I think this originally stands for “packed multiply and add words to doublewords”. That makes it sound like it’s a multiply-add type operation, but really it’s more like a two-element dot product: in pseudocode, PMADDWD computes result.i32[i] = a.i16[i*2+0] * b.i16[i*2+0] + a.i16[i*2+1] * b.i16[i*2+1]. That is, it still does those same four signed 16×16-bit multiplies we’ve been doing for the other two instructions, but this time with a “use the whole pig” attitude where the full 32-bit results are most definitely not tossed out. If we can’t return the whole result in a 16-bit operation, well, just pair even and odd pairs of adjacent lanes together and sum across them. Because when we’re summing across pairs of adjacent lanes, we get 32 bits to return the result in, which is perfect (we don’t need to worry about overflow here because the two constituent products were signed; they can’t get too large).

Now, this description sounds like we’re still finishing computation of 32-bit results for each of the 16 bit lanes, and then doing a separate 32-bit addition after to combine the two. That’s a possible implementation, but not necessary; this is not a post about how multiplications are implemented (some other time, maybe!), but the gist of it is that multiplier hardware already breaks down N-bit by N-bit multiplies into many smaller multiplications (the “partial products”) of a N-bit number by a much smaller digit set. The obvious one would be N-bit-1 bit products, which leaves just “x*0” and “x*1” products, but in practice other options are much cheaper. The partial products are then summed together in a kind of reduction tree, and again, there’s slicker ways to do it than just throwing down a whole bunch of fast adders, but the details don’t matter here. What does matter is that you can have either of the even/odd 16-bit multipliers do their normal thing until very close to the end, and then do the “cross-over” and final 32-bit summation very late (again with plenty of hardware reuse compared with the 16-bit result paths).

In short, not only does PMADDWD let us use both 32-bit results that we already computed anyway fully, it also doesn’t touch the first 90% of the datapath at all and can be made to share plenty of logic with the regular path for the final 10% too if desired. Which is why it’s fun.

SSE

The headline item for SSE was SIMD floating point operations (not my subject today), but it also patched a hole in the original MMX design by adding PMULHUW (packed multiply high unsigned word). This one does the multiply unsigned and gives you the high word result. Once again, this is a minor change to the hardware.

SSE2

This one added 128-bit integer SIMD whereas MMX was just 64-bit. It did so, in its initial implementations, by adding 128-bit registers, but still used a 64-bit datapath and issuing instructions over two cycles. Not surprisingly, then, all the SSE2 integer multiply instructions (and in fact the vast majority of SSE/SSE2 instructions in general) can be understood as working on independent 64-bit blocks at a time. (AVX/AVX2 would later do the same thing with 128-bit blocks.)

It does however add the rather awkward-seeming PMULUDQ (packed multiply unsigned doubleword to quadword), which multiplies two pairs of unsigned 32-bit integers (in bits [31:0] and [95:64] of either source) to give two 64-bit results. And it does so with the same latency as our 16-bit multiplies! Is that a much wider multiplier at work?

Turns out, not necessarily! Let’s look at a single 32-bit product a * b, and split a = (a1 * 65536) + a0 and b = (b1 * 65536) + b0. 65536 is of course 216 and we’re really just chopping a and b in half. Multiplying that out, we get:

 a * b
=((a1 * b1) << 32) + ((a1 * b0 + a0 * b1) << 16) + (a0 * b0)

Wait a second. Those are two 16-bit by 16-bit multiplies (unsigned this time, but we added that capability back with the first SSE) and a PMADDWD-style operation (albeit also on unsigned values) in the middle. We do need four 16×16-bit multiplies total… but we’re producing a 64-bit result, so our result area covers four 16-bit lanes’ worth. So this time we do need a bit more logistics up front to route the individual pieces of a and b to four separate multipliers over our 64-bit result area to line up our individual products, and we also have a somewhat more complicated final output stage (what with the different alignments of the results) and actually need a mode in the multiplier where we run a full 64-bit add, not just 32-bit adds, to produce or results. In short, it does get more complicated, but we’re still getting to build it all around the 16×16-bit multipliers we’ve had since MMX.

SSSE3

SSSE3 adds two more multiply operations, both of which are variations on themes we’ve seen so far, but let’s start with the simple one first.

That would be PMULHRSW, Packed Multiply High Rounding Shifting Words (again, not the official name). It’s another signed 16×16 bit multiply. This one computes signed (not the official way it’s specified, but it’s equivalent) (a * b + 0x4000) >> 15. This requires a slight tweak to the reduction tree to in the multiplier to sum in one extra term somewhere that we can use for the rounding bias. Grabbing different output bits from the result is not a big deal.

The more complicated one is PMADDUBSW which is like PMADDWD but on pairs bytes not words, and to keep things interesting, it’s an unsigned*signed multiply. I think this might have been inspired by AltiVec (or maybe there’s a common ancestor I’m not aware of?) which had this type of operation in its “msum” family of instructions (alongside a PMADDWD equivalent and some other operating modes), but the AltiVec version is nicer because it’s a 4-element dot product on bytes producing a 32-bit result. PMADDUBSW produces a word result which, in turns out, does not quite work. The problem is that multiplying unsigned by signed bytes means the individual product terms are in range [-128*255, 128*255] = [-32640,32640]. Our result is supposed to be a signed word, which means its value range is [-32768,32767]. If the two individual products are either near the negative or positive end of the possible output range, the sum overflows. PMADDUBSW decides to saturate the result, i.e. clamp it to lie within [-32768,32767] instead. This is well-defined but frequently quite annoying to work with.

In terms of implementation, I’ve not actually worked out the details here. I will point out that one way of designing multipliers is to use a few levels of a recursive decomposition into smaller multipliers much as I just did with PMULUDQ; chopping the constituent 16-bit multipliers up into byte pieces would presumably work, although at this point, we don’t just have some extra muxing near the beginning or end of the datapath, we’re also starting to add a lot of constraints on the rest of the internal implementation (if we’re trying to share as much hardware as possible, that is). We’ve just about pushed this as far as we can go.

SSE4.1

SSE4.1 adds PMULDQ which is PMULUDQ, but signed. The same basic approach as PMULUDQ should work, so I’ll not get into it.

It also added the at that point long-awaited PMULLD, doubleword-by-doubleword low multiplies. To date, we have not gotten any high multiplies for them, not in SSE4.1 nor in any subsequent extension, and it seems unlikely at this point.

Curiously, with PMULLD, something’s different: these have half the throughput and twice the latency as all the other multiply-type operations on Intel’s big core designs, and take two uops whereas all the other multiply-type operations mentioned so far take one.

Once again, I think the divide-and-conquer approach described for PMULUDQ above explains both. Looking at the product terms:

 a * b
=((a1 * b1) << 32) + ((a1 * b0 + a0 * b1) << 16) + (a0 * b0)

We don’t care about the high product term for a1 * b1 since we’re only returning the low 32 bits anyway. But we do need the other three product terms, and per each 32-bit result lane, we only have two 16×16 multipliers to work with. My best guess is that the first uop is a PMADDWD-like affair that computes the (a1 * b0 + a0 * b1) portion and stashes the low 16 bits of the result, and the second uop gets issued immediately after and computes a regular a0 * b0 32-bit product in the low 16-bit lane then adds it together with the stashed product (shifted by 16, but that’s just wiring) – this is again a fairly minor variation of the logic that is used for PMADDWD. In short, I think a possible implementation of PMULLD on top of the 16-bit-multiplier-centric datapath that Intel seems to have been bolting more and more features onto for the past 25 years is using 2 uops that are slight variations of the PMADDWD flow, and it would be consistent with the somewhat odd characteristics of the instruction on Intel’s big cores.

Is any of this definitive?

Oh hell no it’s not. But it would make sense given how the whole thing seems to have developed, and it fits available data.

Would I design a new unit this way now if I had to support all these instructions? Probably not. The original MMX/SSE-era design doesn’t need to do a lot of operations and is pretty sweet, but the doubleword->quadword multiplies in SSE2 started to muddle the waters, SSSE3 with PMADDUBSW (which although useful is very odd) made the 16-bit slices have to support some pretty odd things that really break out of a conventional multiplier dataflow (like the internal saturation), and as of SSE4.1 with PMULLD, honestly, instead of teaching this old dog the umpteenth new trick, maybe just throw some actual 32×32->32 multipliers in there as well instead of adding complications. That seems to be what AMD has done, as well as Intel’s E-cores. But it’s still fun to speculate!

What about AVX-512 VPMULLQ, IFMA or VNNI?

The original VNNI is a pretty straightforward generalization of the PMADDWD/PMADDUBSW designs to include a final accumulation too, much like the aforementioned vmsum family of instructions in PowerPC. I have a suspicion this might be due to a combination of a) x86 SIMD datapaths post-AVX2 (which added FMAs) having native support for 3-operand instructions and b) quite possibly some AltiVec-era instruction patents having expired in the meantime. For a), the extra addition is not the hard part (as mentioned above, the extra term is very natural to sneak into any multiplier datapath), the actual cost is all in the extra wiring and bypassing to have a 3-operand datapath. But once it’s a sunk cost because you want float FMAs anyway, might as well use it! For b), I have no idea whether this is the case or not, it’s just funny to me that AltiVec had these integer dot product instructions from the standard while x86 took forever to add them (after people used PMADDUBSW with a follow-up PMADDWD by an all-1’s vector literally just to sum the pairs of words in a 32-bit lane together for something like a decade).

IFMA is a completely different story because even though it’s an integer multiply, it’s very clearly designed to use the double-precision multiplier in the floating-point datapath. Completely different multiplier with a different set of fun design constraints!

VPMULLQ, I have nothing to say about. Literally haven’t ever looked into, or tried to figure out, how it’s implemented. Might do so at some point, but not today!

And I think that’s all of them.