2023-06-19

SIMD in C++

I recently saw a really good conference talk about SIMD in C++ by Jeff Garland. The talk covered a variety of topics around SIMD but what I thought was really interesting was the demonstration of what your average C++ compiler is capable of without writing any SIMD intrinsics. In other words, what can you get for free?

What is SIMD

Single instruction multiple data (SIMD) refers to the vectorisation of an operation. That is, performing some operation over a vector of operands simultaneously as opposed to just one operand at a time. A form of parallel computing.

Take the two arrays of numbers below:

A: [1, 2, 3, 4, 5, 6, 7, 8]
B: [8, 7, 6, 5, 4, 3, 2, 1]

We would like to take each number in array A and the corresponding number (by position) in array B, multiply the numbers together, and put the result into another array C.

Each line above corresponds to a set of instructions to the processor. At the end, array C should look like this:

C: [8, 14, 18, 20, 20, 18, 14, 8]

But what if we could take the numbers in both arrays together and multiply them all in one shot, without having to go through each pair of numbers, one by one? In theory¹, such an operation would be faster as you would only pay the overhead that comes with the operation once.

[1, 2, 3, 4, 5, 6, 7, 8]
*
[8, 7, 6, 5, 4, 3, 2, 1]
=
[8, 14, 18, 20, 20, 18, 14, 8]

So what can the compiler do?

It’s helpful first to understand what the compiler cannot do.

The code below implements the multiplication of two arrays as described above, and populates a third result array. The key thing to note is that mult() is referring to c-style arrays:

void mult(int* a, int* b, int* result) {
    for (int i = 0; i < 8; ++i) {
        result[i] = a[i] * b[i];
    }
}

int main() { 
    int arr1[] = {1,2,3,4,5,6,7,8};
    int arr2[] = {8,7,6,5,4,3,2,1};
    int result[] = {0,0,0,0,0,0,0,0};
    
    mult(arr1, arr2, result);
 }

I used x86-64 gcc 12.2 to compile, which results in the following assembly²:

mult(int*, int*, int*):
       xor     eax, eax
.L2:
       mov     ecx, DWORD PTR [rdi+rax]
       imul    ecx, DWORD PTR [rsi+rax]
       mov     DWORD PTR [rdx+rax], ecx
       add     rax, 4
       cmp     rax, 32
       jne     .L2
       ret

It’s clear that this operation has not been vectorised by the compiler. You can tell because the jne (conditional jump) command seems to take us back to the loop .L2 until the loop condition has been satisfied. Additionally, the multiplication is being performed by imul which is for two operands.³

We can try again but this time with std:array:

#include <array>

using arr8 = std::array<int, 8>;

void mult(const arr8& a, const arr8& b, arr8& result) {
    for (int i = 0; i < 8; ++i) {
        result[i] = a[i] * b[i];
    }
}

int main() { 
    arr8 arr1 = {1,2,3,4,5,6,7,8};
    arr8 arr2 = {8,7,6,5,4,3,2,1};
    arr8 result;
    
    mult(arr1, arr2, result);
 }

mult(std::array<int, 8ul> const&, std::array<int, 8ul> const&, std::array<int, 8ul>&):
        xor     eax, eax
.L2:
        mov     ecx, DWORD PTR [rdi+rax]
        imul    ecx, DWORD PTR [rsi+rax]
        mov     DWORD PTR [rdx+rax], ecx
        add     rax, 4
        cmp     rax, 32
        jne     .L2
        ret

Still no vectorisation! Let’s try one more time but this time we will initialise the std:array that is returned inside mult(), thereby eliminating any non-const parameters to the function:

#include <array>

using arr8 = std::array<int, 8>;

arr8 mult(const arr8& a, const arr8& b) {
    arr8 result;
    for (int i = 0; i < 8; ++i) {
        result[i] = a[i] * b[i];
    }
    return result;
}

int main() { 
    arr8 arr1 = {1,2,3,4,5,6,7,8};
    arr8 arr2 = {8,7,6,5,4,3,2,1};
    
    arr8 result = mult(arr1, arr2);
 }

mult(std::array<int, 8ul> const&, std::array<int, 8ul> const&):
        vmovdqu ymm0, YMMWORD PTR [rdx]
        mov     rax, rdi
        vpmulld ymm0, ymm0, YMMWORD PTR [rsi]
        vmovdqu YMMWORD PTR [rdi], ymm0
        vzeroupper
        ret

Success! vpmulld here is the elusive vector multiplication instruction that we have been trying to tease out of the compiler. A more complete description of this command can be found here:

Performs a SIMD signed multiply of the packed signed dword/qword integers from each element of the first source operand with the corresponding element in the second source operand.

The best things in life aren’t necessarily free

It’s clear that the C++ compiler is able to figure out when operations in your code can be vectorised using SIMD instructions in certain narrow situations. However, it’s not something that you’ll likely see magically applied across your projects. For SIMD optimisations to be applied by the compiler, your code must satisfy various properties such that the compiler can apply those optimisations safely.

If you would prefer to be more intentional about your use of SIMD then there are various developer libraries available:

As for the C++ standard, std:simd is due in C++26, so don’t hold your breath.

In practice, you should be profiling your application as a whole with real-world data and workloads to verify that your optimisations actually work. Sometimes they may not help, other times they could make it worse! ↩︎
I have omitted parts of the assembly output which I feel are not necessary to show here. ↩︎
I’m not an x86 assembly expert by any means, or a C++ expert for that matter. ↩︎

trivially constructible

SIMD in C++

What is SIMD

So what can the compiler do?

The best things in life aren’t necessarily free