Speeding up code with vectorization

Take advantage of vector instructions, the easy way

17 Apr 2020

I’ve been writing a lot of math code with latency requirements these days. When I talk to people about my problems, they usually suggest multithreading and general-purpose GPU computing.

These both have downsides. Multithreading might not be the best option in systems that already have a lot of concurrent workloads. And GPU computing adds the overhead of a round-trip copy to GPU memory, which might end up increasing latency depending on the problem size.

Vectorization can offer a way out. Modern CPUs have several copies of key components like arithmetic logic units (ALUs), allowing them to execute multiple operations in parallel — even on the same CPU.¹ Developers can take advantage of this functionality through special vector instructions.

For the rest of this article, I’ll use Clang 10 and Intel’s AVX2 extension in all examples. However, the same ideas apply to other compilers and architectures.

Here are the rules I’ve developed for vectorization:

Option 1: Make a library do it for you

If we use a library like Eigen, chances are it’s already vectorizing code for us. We can write maintainable, high-level code using Eigen’s Matrix type.

Let’s say we want to compute the Euclidean distance between two vectors:

#include <Eigen/Core>

using MyVector = Eigen::Matrix<float, 16, 1>;

float Distance(const MyVector& a, const MyVector& b) {
  return (a - b).norm();
}

Eigen overloads the subtraction operator with template magic that automatically uses vector routines when they’re enabled.² On x64, passing -mavx2 tells the compiler that it’s safe to generate a binary that uses AVX-2 vector instructions.³

We can now check the assembly to see whether Eigen vectorized for us. If you’re not familiar with reading x64 assembly, that’s okay. All you need to know in this step is:

Vector instruction names begin with the letter “v.”
Vector registers begin with “xmm,” “ymm,” and “zmm.”
Loops are implemented using branch (“b”) or jump (“j”) instructions.

Notice that the entire function uses vector instructions and doesn’t contain any loops despite working with 16-length vectors.

Eigen even decided to use the vsqrtss instruction instead of calling the slower std::sqrt, which has the appropriate error handling for negative inputs.

Distance(Eigen::Matrix<float, 16, 1, 0, 16, 1> const&, Eigen::Matrix<float, 16, 1, 0, 16, 1> const&):
  vmovaps ymm0, ymmword ptr [rdi]
  vmovaps ymm1, ymmword ptr [rdi + 32]
  vsubps ymm0, ymm0, ymmword ptr [rsi]
  vmulps ymm0, ymm0, ymm0
  vsubps ymm1, ymm1, ymmword ptr [rsi + 32]
  vmulps ymm1, ymm1, ymm1
  vaddps ymm0, ymm0, ymm1
  vextractf128 xmm1, ymm0, 1
  vaddps xmm0, xmm0, xmm1
  vpermilpd xmm1, xmm0, 1
  vaddps xmm0, xmm0, xmm1
  vmovshdup xmm1, xmm0
  vaddss xmm0, xmm0, xmm1
  vsqrtss xmm0, xmm0, xmm0
  vzeroupper
  ret

View on Compiler Explorer

Option 2: Make the compiler do it for you

Sometimes it’s not convenient to express an algorithm using an existing library, or maybe the code is small enough that it doesn’t justify the work to bring in another dependency.

In this case, we can write a normal loop and let Clang or GCC translate it into platform-specific vector instructions. The resulting code is extremely flexible, because it can be understood by anyone who writes C++, no library knowledge required. However, the compiler’s vectorizer isn’t as smart as library authors, and there are many situations where it doesn’t generate great code.

This code does computes the same distance, but with a handwritten loop:

#include <array>
#include <cmath>
#include <iostream>
#include <numeric>

float Distance(const float* a, const float* b) {
  // Compute the squared differences between a and b.
  std::array<float, 16> diffs;
  for (int i = 0; i < 16; i++) {
    const float diff = a[i] - b[i];
    diffs[i] = diff * diff;
  }
  // Sum the squares.
  float dist = std::accumulate(diffs.begin(), diffs.end(), 0.0f);
  dist = std::sqrt(dist);
  return dist;
}

We can again read the assembly to find out whether the compiler could vectorize our code. Clang 10 figured it out, mostly.

It successfully vectorized the subtraction and multiplication to compute squared differences.
Unfortunately, Clang wasn’t smart enough to use vectorize the summation of the array of squares, generating an add (vaddss) instruction for each of the 16 items.
It still generates an unnecessary branch with call sqrtf for when the sum of squares is negative.

Distance(float const*, float const*):
  sub rsp, 72
  vmovups ymm0, ymmword ptr [rdi]
  vsubps ymm0, ymm0, ymmword ptr [rsi]
  vmulps ymm0, ymm0, ymm0
  vmovups ymmword ptr [rsp + 8], ymm0
  vmovups ymm0, ymmword ptr [rdi + 32]
  vsubps ymm0, ymm0, ymmword ptr [rsi + 32]
  vmulps ymm0, ymm0, ymm0
  vmovups ymmword ptr [rsp + 40], ymm0
  vxorps xmm1, xmm1, xmm1
  vaddss xmm2, xmm1, dword ptr [rsp + 8]
  vaddss xmm2, xmm2, dword ptr [rsp + 12]
  vaddss xmm2, xmm2, dword ptr [rsp + 16]
  vaddss xmm2, xmm2, dword ptr [rsp + 20]
  vaddss xmm2, xmm2, dword ptr [rsp + 24]
  vaddss xmm2, xmm2, dword ptr [rsp + 28]
  vaddss xmm2, xmm2, dword ptr [rsp + 32]
  vaddss xmm2, xmm2, dword ptr [rsp + 36]
  vaddss xmm2, xmm2, dword ptr [rsp + 40]
  vaddss xmm2, xmm2, dword ptr [rsp + 44]
  vaddss xmm2, xmm2, dword ptr [rsp + 48]
  vaddss xmm2, xmm2, dword ptr [rsp + 52]
  vaddss xmm2, xmm2, dword ptr [rsp + 56]
  vaddss xmm2, xmm2, dword ptr [rsp + 60]
  vaddss xmm2, xmm2, dword ptr [rsp + 64]
  vextractf128 xmm0, ymm0, 1
  vpermilps xmm0, xmm0, 231
  vaddss xmm0, xmm2, xmm0
  vucomiss xmm0, xmm1
  jb .LBB0_2
  vsqrtss xmm0, xmm0, xmm0
  add rsp, 72
  vzeroupper
  ret
.LBB0_2:
  vzeroupper
  call sqrtf
  add rsp, 72
  ret

View on Compiler Explorer

Preventing auto-vectorizer regressions

One downside of auto-vectorization is that it’s not obvious to the reader that the compiler is generating vector instructions instead of the usual loop. We can consider using compiler extensions to ensure that our code stays vectorized.

In Clang, we write #pragma clang loop vectorize(enable)⁴ above any loop we want the compiler to vectorize. The pragma also emits a warning if the code fails to vectorize. For example, someone might inadvertently prevent vectorization by adding a print statement inside the loop. To avoid merging such a change, we can configure our build system to treat this warning as an error.

LLVM’s auto-vectorization documentation has more information about the supported flags and use cases.

Designing data structures for easier vectorization

When we vectorized with Eigen, the library’s organization around its Matrix type made vectorization easier. This is because all the values for each operand are stored in contiguous memory.

Writing normal C++ code doesn’t force us to comply with this constraint. We could have stored each pair values from a and b as members of a struct:

struct Input {
  float a = 0.0f;
  float b = 0.0f;
};
std::vector<Input> my_inputs = ...;

View on Compiler Explorer

The vector my_inputs stores the values interleaved: a, b, a, b, and so on. These values will need to be deinterleaved when they are loaded into a vector register so that we can have one register which only contains values from a, and another with values from b.

In the best case, the compiler wastes a few instructions shuffling data around. In the worst case, the interleaved data prevents the compiler from vectorizing the code at all.

Option 3: Write vector code by hand

If all else fails, then we’re stuck with writing vector code by hand. This doesn’t mean writing assembly. Compilers offer built-in functions, most of which map directly to vector instructions.

The benefit of this approach is fine-grained control: for example, we can choose faster math instructions if our algorithm can tolerate an approximate answer. This comes at the cost of flexibility. We must reimplement the algorithm for each target architecture. We also need to rewrite if we want the performance improvements offered by future instruction set updates.

The process usually starts with opening the Intel Intrinsics Guide to find the relevant functions. We can filter only instructions available on our target processor, then use search to find the ones (subtraction, multiplication, and square root) needed to implement our algorithm.

This handwritten code generates the same assembly as the first example:

#include <immintrin.h>

float Distance(const float* a, const float* b) {
  // Load the 16-length inputs into two 8-length registers each.
  __m256 a1 = _mm256_load_ps(a);
  __m256 a2 = _mm256_load_ps(a + 8);
  __m256 b1 = _mm256_load_ps(b);
  __m256 b2 = _mm256_load_ps(b + 8);

  // Take the square of difference of the first 8 items.
  __m256 diff1 = _mm256_sub_ps(a1, b1);
  diff1 = _mm256_mul_ps(diff1, diff1);
  // Take the square of difference of the second 8 items.
  __m256 diff2 = _mm256_sub_ps(a2, b2);
  diff2 = _mm256_mul_ps(diff2, diff2);

  // Sum the first and second differences.
  __m256 sum_diffs = _mm256_add_ps(diff1, diff2);

  // Split 8-length register into two 4-length registers.
  __m128 upper = _mm256_extractf128_ps(sum_diffs, 1);
  __m128 lower = _mm256_extractf128_ps(sum_diffs, 0);
  // Add the upper and lower sums.
  __m128 sum1 = _mm_add_ps(lower, upper);

  // Swap the upper and lower two items.
  __m128 sum2 = _mm_permute_pd(sum1, 1);
  // Add the swapped vectors.
  sum1 = _mm_add_ps(sum1, sum2);

  // Move the 2nd item into the 1st position.
  sum2 = _mm_movehdup_ps(sum1);
  // Add the swapped vectors.
  sum1 = _mm_add_ps(sum1, sum2);

  // Take square root.
  sum1 = _mm_sqrt_ps(sum1);

  // Extract the result to a scalar array.
  float result[4];
  _mm_storeu_ps(&result[0], sum1);
  return result[0];
}

View on Compiler Explorer

It’s a good idea to keep the scalar implementation around so that a unit test can that we didn’t introduce any bugs from manual vectorization.

We can also consider a portable SIMD wrapper library like xsimd or MIPP. These libraries provide their own vector types, which the library maps to architecture-specific intrinsics. Keep in mind that libraries might not expose obscure operations that are only available on one architecture, so the portability can have a performance cost.

Conclusion

There are many ways to use vector instructions on modern processors. To write the most maintainable code possible for an application, try the options in this order:

Use an existing library, like Eigen or OpenCV. The code is extremely readable and fast, if the algorithm can be written in terms of the library’s high-level operations.
Use the compiler’s auto-vectorizer. The code is mostly readable and mostly fast, though complicated operations can confuse the compiler.
Write the vector code by hand. The code is tedious to write and maintain, but can express exactly what we want.

Superscalar processor, Wikipedia. ↩
For more information on how Eigen is implemented, check out What happens inside Eigen, on a simple example. ↩
See Eigen Vectorization FAQ for details on how to enable vectorization. ↩
Auto-Vectorization in LLVM. ↩

Thanks for reading! If you’re enjoying my writing, I’d love to send you infrequent notifications for new posts via my newsletter. You’ll receive the full text of each post, plus occasional bonus content.

You can also follow me on Twitter (@kevinchen) or subscribe via RSS.