I’ve been writing a lot of math code with latency requirements these days. When I talk to people about my problems, they usually suggest multithreading and generalpurpose GPU computing.
These both have downsides. Multithreading might not be the best option in systems that already have a lot of concurrent workloads. And GPU computing adds the overhead of a roundtrip copy to GPU memory, which might end up increasing latency depending on the problem size.
Vectorization can offer a way out. Modern CPUs have several copies of key components like arithmetic logic units (ALUs), allowing them to execute multiple operations in parallel — even on the same CPU.^{1} Developers can take advantage of this functionality through special vector instructions.
For the rest of this article, I’ll use Clang 10 and Intel’s AVX2 extension in all examples. However, the same ideas apply to other compilers and architectures.
Here are the rules I’ve developed for vectorization:
Option 1: Make a library do it for you
If we use a library like Eigen, chances are it’s already
vectorizing code for us. We can write maintainable, highlevel code using
Eigen’s Matrix
type.
Let’s say we want to compute the Euclidean distance between two vectors:
#include <Eigen/Core>
using MyVector = Eigen::Matrix<float, 16, 1>;
float Distance(const MyVector& a, const MyVector& b) {
return (a  b).norm();
}
Eigen overloads the subtraction operator with template magic that automatically
uses vector routines when they’re enabled.^{2} On x64, passing
mavx2
tells the compiler that it’s safe to generate a binary that uses AVX2
vector instructions.^{3}
We can now check the assembly to see whether Eigen vectorized for us. If you’re not familiar with reading x64 assembly, that’s okay. All you need to know in this step is:
 Vector instruction names begin with the letter “v.”
 Vector registers begin with “xmm,” “ymm,” and “zmm.”
 Loops are implemented using branch (“b”) or jump (“j”) instructions.
Notice that the entire function uses vector instructions and doesn’t contain any loops despite working with 16length vectors.
Eigen even decided to use the vsqrtss
instruction instead of
calling the slower std::sqrt
, which has the appropriate error handling for
negative inputs.
Distance(Eigen::Matrix<float, 16, 1, 0, 16, 1> const&, Eigen::Matrix<float, 16, 1, 0, 16, 1> const&):
vmovaps ymm0, ymmword ptr [rdi]
vmovaps ymm1, ymmword ptr [rdi + 32]
vsubps ymm0, ymm0, ymmword ptr [rsi]
vmulps ymm0, ymm0, ymm0
vsubps ymm1, ymm1, ymmword ptr [rsi + 32]
vmulps ymm1, ymm1, ymm1
vaddps ymm0, ymm0, ymm1
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1
vaddps xmm0, xmm0, xmm1
vmovshdup xmm1, xmm0
vaddss xmm0, xmm0, xmm1
vsqrtss xmm0, xmm0, xmm0
vzeroupper
ret
Option 2: Make the compiler do it for you
Sometimes it’s not convenient to express an algorithm using an existing library, or maybe the code is small enough that it doesn’t justify the work to bring in another dependency.
In this case, we can write a normal loop and let Clang or GCC translate it into platformspecific vector instructions. The resulting code is extremely flexible, because it can be understood by anyone who writes C++, no library knowledge required. However, the compiler’s vectorizer isn’t as smart as library authors, and there are many situations where it doesn’t generate great code.
This code does computes the same distance, but with a handwritten loop:
#include <array>
#include <cmath>
#include <iostream>
#include <numeric>
float Distance(const float* a, const float* b) {
// Compute the squared differences between a and b.
std::array<float, 16> diffs;
for (int i = 0; i < 16; i++) {
const float diff = a[i]  b[i];
diffs[i] = diff * diff;
}
// Sum the squares.
float dist = std::accumulate(diffs.begin(), diffs.end(), 0.0f);
dist = std::sqrt(dist);
return dist;
}
We can again read the assembly to find out whether the compiler could vectorize our code. Clang 10 figured it out, mostly.
 It successfully vectorized the subtraction and multiplication to compute squared differences.
 Unfortunately, Clang wasn’t smart enough to use vectorize the summation of
the array of squares, generating an add (
vaddss
) instruction for each of the 16 items.  It still generates an unnecessary branch with
call sqrtf
for when the sum of squares is negative.
Distance(float const*, float const*):
sub rsp, 72
vmovups ymm0, ymmword ptr [rdi]
vsubps ymm0, ymm0, ymmword ptr [rsi]
vmulps ymm0, ymm0, ymm0
vmovups ymmword ptr [rsp + 8], ymm0
vmovups ymm0, ymmword ptr [rdi + 32]
vsubps ymm0, ymm0, ymmword ptr [rsi + 32]
vmulps ymm0, ymm0, ymm0
vmovups ymmword ptr [rsp + 40], ymm0
vxorps xmm1, xmm1, xmm1
vaddss xmm2, xmm1, dword ptr [rsp + 8]
vaddss xmm2, xmm2, dword ptr [rsp + 12]
vaddss xmm2, xmm2, dword ptr [rsp + 16]
vaddss xmm2, xmm2, dword ptr [rsp + 20]
vaddss xmm2, xmm2, dword ptr [rsp + 24]
vaddss xmm2, xmm2, dword ptr [rsp + 28]
vaddss xmm2, xmm2, dword ptr [rsp + 32]
vaddss xmm2, xmm2, dword ptr [rsp + 36]
vaddss xmm2, xmm2, dword ptr [rsp + 40]
vaddss xmm2, xmm2, dword ptr [rsp + 44]
vaddss xmm2, xmm2, dword ptr [rsp + 48]
vaddss xmm2, xmm2, dword ptr [rsp + 52]
vaddss xmm2, xmm2, dword ptr [rsp + 56]
vaddss xmm2, xmm2, dword ptr [rsp + 60]
vaddss xmm2, xmm2, dword ptr [rsp + 64]
vextractf128 xmm0, ymm0, 1
vpermilps xmm0, xmm0, 231
vaddss xmm0, xmm2, xmm0
vucomiss xmm0, xmm1
jb .LBB0_2
vsqrtss xmm0, xmm0, xmm0
add rsp, 72
vzeroupper
ret
.LBB0_2:
vzeroupper
call sqrtf
add rsp, 72
ret
Preventing autovectorizer regressions
One downside of autovectorization is that it’s not obvious to the reader that the compiler is generating vector instructions instead of the usual loop. We can consider using compiler extensions to ensure that our code stays vectorized.
In Clang, we write #pragma clang loop vectorize(enable)
^{4} above any
loop we want the compiler to vectorize. The pragma also emits a warning if the
code fails to vectorize. For example, someone might inadvertently prevent
vectorization by adding a print statement inside the loop. To avoid merging such
a change, we can configure our build system to treat this warning as an error.
LLVM’s autovectorization documentation has more information about the supported flags and use cases.
Designing data structures for easier vectorization
When we vectorized with Eigen, the library’s organization around its Matrix
type made vectorization easier. This is because all the values for each operand
are stored in contiguous memory.
Writing normal C++ code doesn’t force us to comply with this constraint. We
could have stored each pair values from a
and b
as members of a struct:
struct Input {
float a = 0.0f;
float b = 0.0f;
};
std::vector<Input> my_inputs = ...;
The vector my_inputs
stores the values interleaved: a, b, a, b, and so on.
These values will need to be deinterleaved when they are loaded into a vector
register so that we can have one register which only contains values from a
,
and another with values from b
.
In the best case, the compiler wastes a few instructions shuffling data around. In the worst case, the interleaved data prevents the compiler from vectorizing the code at all.
Option 3: Write vector code by hand
If all else fails, then we’re stuck with writing vector code by hand. This doesn’t mean writing assembly. Compilers offer builtin functions, most of which map directly to vector instructions.
The benefit of this approach is finegrained control: for example, we can choose faster math instructions if our algorithm can tolerate an approximate answer. This comes at the cost of flexibility. We must reimplement the algorithm for each target architecture. We also need to rewrite if we want the performance improvements offered by future instruction set updates.
The process usually starts with opening the Intel Intrinsics Guide to find the relevant functions. We can filter only instructions available on our target processor, then use search to find the ones (subtraction, multiplication, and square root) needed to implement our algorithm.
This handwritten code generates the same assembly as the first example:
#include <immintrin.h>
float Distance(const float* a, const float* b) {
// Load the 16length inputs into two 8length registers each.
__m256 a1 = _mm256_load_ps(a);
__m256 a2 = _mm256_load_ps(a + 8);
__m256 b1 = _mm256_load_ps(b);
__m256 b2 = _mm256_load_ps(b + 8);
// Take the square of difference of the first 8 items.
__m256 diff1 = _mm256_sub_ps(a1, b1);
diff1 = _mm256_mul_ps(diff1, diff1);
// Take the square of difference of the second 8 items.
__m256 diff2 = _mm256_sub_ps(a2, b2);
diff2 = _mm256_mul_ps(diff2, diff2);
// Sum the first and second differences.
__m256 sum_diffs = _mm256_add_ps(diff1, diff2);
// Split 8length register into two 4length registers.
__m128 upper = _mm256_extractf128_ps(sum_diffs, 1);
__m128 lower = _mm256_extractf128_ps(sum_diffs, 0);
// Add the upper and lower sums.
__m128 sum1 = _mm_add_ps(lower, upper);
// Swap the upper and lower two items.
__m128 sum2 = _mm_permute_pd(sum1, 1);
// Add the swapped vectors.
sum1 = _mm_add_ps(sum1, sum2);
// Move the 2nd item into the 1st position.
sum2 = _mm_movehdup_ps(sum1);
// Add the swapped vectors.
sum1 = _mm_add_ps(sum1, sum2);
// Take square root.
sum1 = _mm_sqrt_ps(sum1);
// Extract the result to a scalar array.
float result[4];
_mm_storeu_ps(&result[0], sum1);
return result[0];
}
It’s a good idea to keep the scalar implementation around so that a unit test can that we didn’t introduce any bugs from manual vectorization.
We can also consider a portable SIMD wrapper library like xsimd or MIPP. These libraries provide their own vector types, which the library maps to architecturespecific intrinsics. Keep in mind that libraries might not expose obscure operations that are only available on one architecture, so the portability can have a performance cost.
Conclusion
There are many ways to use vector instructions on modern processors. To write the most maintainable code possible for an application, try the options in this order:
 Use an existing library, like Eigen or OpenCV. The code is extremely readable and fast, if the algorithm can be written in terms of the library’s highlevel operations.
 Use the compiler’s autovectorizer. The code is mostly readable and mostly fast, though complicated operations can confuse the compiler.
 Write the vector code by hand. The code is tedious to write and maintain, but can express exactly what we want.

Superscalar processor, Wikipedia. ↩

For more information on how Eigen is implemented, check out What happens inside Eigen, on a simple example. ↩

See Eigen Vectorization FAQ for details on how to enable vectorization. ↩