vectorization and parallelization of 2D convolutions

a convolution is defined as an integral that expresses the amount of overlap two signals represented by functions of time \(f(t)\) and \(g(t)\) have as one is reflected and shifted on top of the other. It is usually denoted by the convolution operator \(f * g\) :

\[ (f * g)(t) = \int_{-\infty}^{\infty} f(t-\tau)g(\tau) d\tau \]

where \(f(t)\) is a one-dimensional signal and \(g(t)\) is the response of a filter. As \(t\) changes, we calculate the area under the function \(f(-\tau)\) weighted by \(g(\tau)\). The negation of \(\tau\) rotates \(f\) 180\(^\circ\) about the origin. \(t\) shifts the origin of \(f\) along the \(\tau\)-axis. This can be applied on either signal, because \( (f*g)(t) = (g*f)(t) \) :

\[ \int_{-\infty}^{\infty} f(t-\tau)g(\tau) d\tau = \int_{-\infty}^{\infty} f(\tau)g(t-\tau) d\tau \]

in 2 dimensional image processing, the continuous convolution integral can be expressed as a function in a two dimensional spatial domain of coordinates \(x\) and \(y\) :

\[ (f * * g)(x, y) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x-\tau_i, y-\tau_j)g(\tau_i, \tau_j) d\tau_i d\tau_j \]

storing an image digitally imposes some restrictions due to a limited amount of resources. The continuous signals are sampled and quantised over a finite range to discrete values represented as tensors. The operation becomes a function of sums limited by the bounds of the filter size \(K\) :

\[ (f**g)(x, y) \equiv \sum_{\tau_i=0}^{K} \sum_{\tau_j=0}^{K} f_{x-\tau_i, y-\tau_j} \cdot g_{\tau_i, \tau_j} \]

where \(f\) is the input image \(f \in \mathbb{R}^{N \times M}\) and \(g\) is the filter \(g \in \mathbb{R}^{K \times K}\), an array of floats that serve as the trainable parameters of the network commonly initialized to a random distribution and trained using stochastic gradiant descent.

the discrete value of the signal at some index \(f_{i, j}\) is usually stored in memory as a vector of three integers \( [r, g, b]^{\intercal}\) where each takes a value \( 0 \le r,g,b \lt 2^x \) were x is the number of bits allocated to store the integer. We can represent this as a third channel dimension denoted as \(C\). Most networks also extend the filter into multiple channels in order to extract a higher number of features.

\[ f_{C \times M \times N} = \begin{bmatrix} R_{M \times N} \\ G_{M \times N} \\ B_{M \times N} \end{bmatrix}, \quad R_{M \times N} = \begin{bmatrix} r_{0, 0} & r_{0, 1} & \ldots & r_{0, N} \\ r_{1, 0} & r_{1, 1} & \ldots & r_{1, N} \\ \vdots & \vdots & \ddots & \vdots \\ x_{M, 0} & x_{M, 1} & \ldots & x_{M, N} \end{bmatrix}\]

it is also common to take as input a batch of multiple images denoted as \(B\) as opposed to a single image.

thus, \(f\) becomes a tensor of rank 4 \(f \in \mathbb{Z}^{B \times C \times N \times M}\) representing a batch of \(B\) images, each with \(C\) channels of size \(N \times M\). This yields a filter \(g \in \mathbb{R}^{1 \times C \times K \times K}\) where \(B=1\) because we intend to use the same filter on all images.

we will denote the output tensor as \(\gamma \in \mathbb{R}^{B \times C \times M_{out} \times N_{out}}\).

\[ \gamma_{b, x, y} = \sum^K_{\tau_i}\sum^K_{\tau_j} f_{b, c, x-\tau_m, y-\tau_n} \cdot g_{c, \tau_m, \tau_n}\]

convolution vs cross-correlation

a convolution integral that doesn't reflect the shifted signal is known as a cross-correlation, denoted as \((f \star g)(t)\) :

\[ (f \star g)(t) := \int_{-\infty}^{\infty} f(\tau)g(\tau+t) d\tau \]

this operation is also known as a sliding dot product. You will often find it defined using the complex conjugate of \(f\) to account for complex signals :

\[ (f \star g)(t) := \int_{-\infty}^{\infty} \overline{f(\tau)}g(\tau+t) d\tau\]

if the shifted signal is symmetric, cross-correlation and convolution become the same operation. We will be shifting the filter tensor \(g\), which is randomly initialized to some distribution. If we think of this tensor as being filled with a probability density function \(\xi\), it becomes symmetric. We can, therefore, remove the complexity of having to reflect it.

when people refer to convolutions in the context of convolutional neural networks, they often refer to cross-correlation.

Also, it is not commutative :\( (f \star g)(t) \neq (g \star f)(t) \)


all of the examples are running on :

we'll consider a batch of 512 images from the MNIST dataset, stored linearly in memory as an array of floats with a height and width of \(M=N=28\) and one luminosity channel.

\[ f \in \mathbb{R}^{512 \times 1 \times 28 \times 28}, \quad 0 \le f_{b,c,i,j} \lt 256\]

the filter will have 32 channels and a height and width of \(K=5\). It is initialized to a kaiming uniform distribution :

\[ g \in \mathbb{R}^{1 \times 32 \times 5 \times 5} \sim U \begin{bmatrix} -\frac{\sqrt{6}}{\sqrt{K + K + 1}}, \frac{\sqrt{6}}{\sqrt{K+K+1}} \end{bmatrix} \]


a naive implementation would be :

  void naive_conv2d(const float* f, const float* g, float* out, 
                    const int K, const int N, const int B) 
    const int n_out = (N-K)/1+1; // (img_size-filter_size) / stride + 1
    const int m_out = (M-K)/1+1;
    const int bo = n_out*m_out;  // batch stride out
    const int bi = N*M;          // batch stride in
    int b, n, m, i, j;

    for(b=0; n < B; n++)
      for(n=0; n < n_out; n++)
        for(m=0; m < m_out; m++) 
          for(i=0; i < K; i++)
            for(j=0; j < K; j++)
              out[b*bo+n*n_out+m] += f[b*bi+(n+i)*N+j+m] * g[i*K+j];

called on heap allocated unaligned memory.

because of the high number of float operations, when compiled, the program makes use of SIMD vector instructions instead of x87 precision ones. The CPU supports AVX-256 and contains 16 256-bit ymm logical registers per core. The program first tries to fill as many of these registers as possible with broadcasted (vbroadcastss) values of the filter g from the general purpose register r12. Because there are 25 distinct filter values, it stores the overflow in stack variables between 392[rsp] and 104[rsp]:

    [ . . . ]
    vbroadcastss  ymm0, DWORD PTR 60[r12]
    lea rax, 16[r13]
    vbroadcastss  ymm1, DWORD PTR 56[r12]
    lea r8, 4[r13]
    mov QWORD PTR -32[rsp], rax
    lea rax, 116[r13]
    vbroadcastss  ymm15, DWORD PTR [r12]
    vbroadcastss  ymm14, DWORD PTR 4[r12]
    mov QWORD PTR -24[rsp], rax
    lea rax, 120[r13]
    vbroadcastss  ymm13, DWORD PTR 8[r12]
    lea rdi, 8[r13]
    mov QWORD PTR -16[rsp], rax
    lea rax, 124[r13]
    vbroadcastss  ymm12, DWORD PTR 12[r12]
    lea rsi, 12[r13]
    mov QWORD PTR -8[rsp], rax
    lea rax, 128[r13]
    vbroadcastss  ymm11, DWORD PTR 16[r12]
    lea rcx, 112[r13]
    mov QWORD PTR [rsp], rax
    lea rax, 224[r13]
    vbroadcastss  ymm10, DWORD PTR 20[r12]
    lea rbx, 456[r13]
    mov QWORD PTR 8[rsp], rax
    lea rax, 228[r13]
    vbroadcastss  ymm9, DWORD PTR 24[r12]
    lea r11, 460[r13]
    mov QWORD PTR 16[rsp], rax
    lea rax, 232[r13]
    vbroadcastss  ymm8, DWORD PTR 28[r12]
    lea r10, 464[r13]
    vmovaps YMMWORD PTR 392[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 64[r12]
    mov QWORD PTR 24[rsp], rax
    lea rax, 236[r13]
    mov QWORD PTR 32[rsp], rax
    lea rax, 240[r13]
    mov QWORD PTR 40[rsp], rax
    lea rax, 336[r13]
    vbroadcastss  ymm7, DWORD PTR 32[r12]
    mov QWORD PTR 48[rsp], rax
    lea rax, 340[r13]
    vbroadcastss  ymm6, DWORD PTR 36[r12]
    mov QWORD PTR 56[rsp], rax
    lea rax, 344[r13]
    vbroadcastss  ymm5, DWORD PTR 40[r12]
    vbroadcastss  ymm4, DWORD PTR 44[r12]
    mov QWORD PTR 64[rsp], rax
    lea rax, 348[r13]
    vbroadcastss  ymm3, DWORD PTR 48[r12]
    vbroadcastss  ymm2, DWORD PTR 52[r12]
    vmovaps YMMWORD PTR 360[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 68[r12]
    mov QWORD PTR 72[rsp], rax
    lea rax, 352[r13]
    mov QWORD PTR 80[rsp], rax
    lea rax, 448[r13]
    mov QWORD PTR -120[rsp], r14
    mov r14, r12
    mov QWORD PTR 88[rsp], rax
    lea rax, 452[r13]
    vmovaps YMMWORD PTR -88[rsp], ymm1
    mov QWORD PTR 96[rsp], rax
    xor eax, eax
    vmovaps YMMWORD PTR 328[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 72[r12]
    vmovaps YMMWORD PTR 296[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 76[r12]
    vmovaps YMMWORD PTR 264[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 80[r12]
    vmovaps YMMWORD PTR 232[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 84[r12]
    vmovaps YMMWORD PTR 200[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 88[r12]
    vmovaps YMMWORD PTR 168[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 92[r12]
    vmovaps YMMWORD PTR 136[rsp], ymm0
    vbroadcastss  ymm0, DWORD PTR 96[r12]
    mov r12, r15
    mov r15, r14
    mov r14, r9
    vmovaps YMMWORD PTR 104[rsp], ymm0

an example store relative to the rsp stack pointer would be:

    vbroadcastss  ymm0, DWORD PTR 72[r12]
    vmovaps YMMWORD PTR 296[rsp], ymm0 

we end up with 14 ymm15-2 registers filled with broadcasted values of g and 11 stack variables of 32 bytes we can easily load. Any variable stored in the stack can be assumed to be in L1D cache. Intermittently, the program also stores pointers at different offsets into f from the register r13. For example :

    lea rax, 352[r13]
    mov QWORD PTR 80[rsp], rax
    lea rax, 448[r13] 
    [ . . . ]
    mov QWORD PTR 88[rsp], rax 

lea, or load effective address grabs the location in memory of the second operand and stores it in the first operand. It works like &variable in C. We store 17 pointers of 8 bytes each. These offsets represent almost all of the read addresses of f for a single filter position. To understand the values we first have to look at how the compiler decided to vectorize the operation.

if we were to lay out 8 consecutive horizontal positions of the filter with a stride of 1, we would notice that the first 8 consecutive array elements of f correspond to the same element of the filter for each of the individual filter positions:

we'll designate a ymm register to accumulate into. We then load the first 8 values from f, multiply them by the first element of g and add them to the output register. Next, we stride by one element (4 bytes) and repeat \(K\) number of times. By this point we have read into memory 12 individual values from f. We then jump to the beginning of the next matrix row and repeat \(K\) number of times. We end up computing 8 consecutive convolutions in parallel.

or, in pseudocode :

 \(i, j \leftarrow 0\)
 ymm0 \(\leftarrow\) broadcast 0
 while \(i < K\):
  while \(j < K\):
   ymm1 \(\leftarrow\) \(f + j\), read 32 bytes
   ymm2 \(\leftarrow\) \(g + j\), broadcast 4 bytes to 32 bytes
   ymm0 \(\leftarrow\) (ymm1 * ymm2) + ymm0
   \(j \leftarrow j + 1\)
  \(f \leftarrow f + 96\) bytes
  \(g \leftarrow g + K*4\) bytes
  \(i \leftarrow i + 8\)

this yields 5 32-byte reads per row, iterated for 5 rows, resulting in a total of 25 load instructions from f. The 17 stack variables store the address for the majority of these necessary load instructions, with the exception of some that are loaded directly into registers:

    lea r8, 4[r13]
    lea rdi, 8[r13]
    lea rsi, 12[r13]
    lea rcx, 112[r13]
    lea rbx, 456[r13]
    lea r11, 460[r13]
    lea r10, 464[r13] 

r13 with the first 3 registers point to the first 4 loads of the first row, rcx points to the first address from the second row and the rest point to the first 3 addresses of the 5th row. Together with the previous 17 addresses and r13 itself, they total 25.

as we'll see later, these addresses are accessed sequentially, so loading a register with the addresses of the last row is unnecessary. We can guess that the compiler intended to implement a form of software pipelining, an out of order optimization method used on loops that are not inter-dependent across iterations, mimicking CPU pipelining, but the loops turned out to be dependent.

we advance into the actual computation:

    vmovups ymm0, YMMWORD PTR 0[r13+rax]
    vfmadd213ps ymm0, ymm15, YMMWORD PTR [rdx+rax]
    mov r9, QWORD PTR -32[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm14, YMMWORD PTR [r8+rax]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm13, YMMWORD PTR [rdi+rax]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm12, YMMWORD PTR [rsi+rax]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm11, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR -24[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm10, YMMWORD PTR [rcx+rax]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm9, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR -16[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm8, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR -8[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm7, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR [rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm6, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 8[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm5, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 16[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm4, YMMWORD PTR [r9+rax]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    mov r9, QWORD PTR 24[rsp]
    vmovaps ymm1, YMMWORD PTR -88[rsp]
    vfmadd231ps ymm0, ymm3, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 32[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm2, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 40[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 48[rsp]
    vmovaps ymm1, YMMWORD PTR 392[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 56[rsp]
    vmovaps ymm1, YMMWORD PTR 360[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 64[rsp]
    vmovaps ymm1, YMMWORD PTR 328[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 72[rsp]
    vmovaps ymm1, YMMWORD PTR 296[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 80[rsp]
    vmovaps ymm1, YMMWORD PTR 264[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 88[rsp]
    vmovaps ymm1, YMMWORD PTR 232[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 96[rsp]
    vmovaps ymm1, YMMWORD PTR 200[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    vmovaps ymm1, YMMWORD PTR 168[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [rbx+rax]
    vmovaps ymm1, YMMWORD PTR 136[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r11+rax]
    vmovaps ymm1, YMMWORD PTR 104[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r10+rax]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    add rax, 32
    cmp rax, 96
    jne .L5
    mov r9, r14
    mov r14, QWORD PTR -120[rsp]
    mov rax, r15
    mov r15, r12
    lea rdi, 112[r9]
    mov r12, rax 

the first 3 vfmaddps instructions are highlighted for convenience. The first instance is vfmadd213ps and all subsequent ones are vfmadd231ps. The numbers indicate the order of the operands, 213 means operand 2 is multiplied to 1, added with 3 and stored in 1. The first operand ymm0 serves as the accumulation register, the second operand makes use of the previousely broadcasted values of g in registers ymm15-2 and afterwards loads from memory into ymm1:

    vmovaps ymm1, YMMWORD PTR 392[rsp] 

the third operand points to the values of f and makes use of the previously stored pointers to which we add the register rax. Usually register addition is straightforward, we simply add the values stored inside. This time, however, the registers contain memory addresses:

    mov r9, QWORD PTR 96[rsp] 

The memory in question is virtual memory, who is described in the AMD64 Architecture Guide as consisting of the entire address space available to programs. It is a large linear-address space that is translated by a combination of hardware and operating-system software to a smaller physical-address space, parts of which are located in memory and parts on disk or other external storage media.

In the case of my CPU, it is a 64-bit linear memory space that is by default unsegmented. Operating systems create segmentations between code, stack and data for memory protection purposes. The base address of the memory space is 0 and a location in it is specified using a segment:offset logical address, often refered to as a long pointer. Inside of a segment, we only need the offset part of the logical address, refered to as an effective address, or a near pointer. All of these addresses are manipulated as unsigned integers offsets from the root of the memory space and support integer arithmetic.

the register rax is used as both a displacement into f and for triggering the loop back to .L5. In the C code this loop tracks m_out. The previousely computed addresses don't need to be computed again and we simply offset from them. This has a latency of 0 cycles. The register starts at 0 and is incremented by 32 bytes each loop until it is equal to 96, so 3 loops, repeating the 8 convolutions until they compute 24 values. Before this loop, the last operation with rax is a lea into the address 448[r13]. Why is this address treated as 0?

rax is the 64-bit extension of eax. They refer to the same logical register, so the instruction that follows:

     xor eax, eax

sets the lower half of the register to 0. Writing to eax automatically clears the upper 32 bits, so the entire register is now 0.

The third operand is also the only one that is grabbed directly from memory. We can guarantee the other two are in L1D, but this one could be anywhere if not properly prefetched. Thankfully, modern CPUs are really good at prefetching, so we are unlikely to incur a cache miss, but our later implementations will explicitly load L1D and L2 cache.

let us turn our attention towards instruction-level parallelism.

modern CPUs are superscalers and contain multiple execution units that can process instructions concurrently each cycle. For example, the Floating Point Unit on the Zen 2 microarchitecture manages 4 execution pipes 256-bit wide that can handle x87, MMX, SSE and AVX instructions. For single precision floats, this puts us at 32 FLOPs/cycle. But while we can launch up to 4 instructions per cycle, we are still held back by the actual execution time until that instruction is resolved and retired. This is refered to as an instruction's latency. From Agner Fog's instruction tables we learn that both VFMADDPS and VADDPS have a latency of 5 cycles, meaning we can only expect the output to be usable in another computation after 5 cycles.

in the case of a load instruction like VMOVUPS who also has a 5 cycle latency, it only holds true if the data is already in L1D cache. If there is a cache miss, we also have to add the latency of bringing the data into the L1 cache. This latency depends on where the data is. The worst case scenario is the RAM, in which case we have to add 66 ns and 38 cycles, which is bound to stall the process in the CPU. You can see the cost of other cache layers for Zen 2 here.

data dependencies are another reason we fail to schedule parallel execution. Let us imagine a loop that calls two instructions:

     add rax, 1
     add rax, 2 

the second call to add has to wait for the first one to resolve before executing. It is data dependent on the first call and can not be processed in parallel. Another method of achieving parallelism is software pipelining, in which we would unroll the loop and attempt to execute different iterations in parallel. Here we hit the second type of dependency, inter-loop. We cannot begin to work on the second iteration of the loop without waiting for the first iteration to fully resolve the value of rax. Dependencies are a property of programs, and depend entirely on the logical algorithm chosen. The above example cannot be run in parallel, so we decide to change the algorithm to use an extra register:

    add rax, 1
    add rbx, 2 
  add rax, rbx

now both add instructions are independent and can run in parallel. The cost of a loop iteration is the latency of a single add as opposed to 2 in the last example. Note this example is still inter-loop dependent.

often times when we have to translate a software algorithm into a hardware circuit on a PCB, we deal with two properties: data flow and control flow. They serve as a framework to understand dependencies.

data flow describes the logic of the program. It follows the informational input as it is digested through the instruction pipeline. It is often visualised in the form of a data flow graph where each node is an operation, or a specific instruction, and every edge is a data dependency. If we were to execute two instructions in a way that would violate a data edge relationship, we would change the logic of the program. It follows that instructions with no data edge relationship can be executed out of order.

control flow describes the implementation of the data flow, the order of operations as described by the program. The fact that two operations have to execute sequentially here is not a contraint of a program's logic but a choice on how that logic was computationally implemented. It is often visualised as a control flow graph where each node is an operation, or an instruction, and each edge is the flow of control. The goal is to reduce control flow dependencies.

they are similar, but different. Consider the following example :

      jmp b 

        add eax, 1
      jmp a 

the jump to a does nothing, yet the loop still has to yield control.

both are needed to fully describe a computer program. An efficient implementation is one that follows the DFG with as little control sequential overhead as possible. This is at points harder then it seems because of the sequential property of programming languages. Modern compilers handle the parallelization and use these flows for optimisations like constant propagation and live-variable analysis.

because our two inner loops are fully unrolled, there is no control flow overhead affecting them. A full single run of the pre-unrolled loop can be described by these 4 instructions:

    mov r9, QWORD PTR 56[rsp]
    vmovaps ymm1, YMMWORD PTR 360[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax] 

we notice a couple of dependencies right away. vfmadd231ps needs the pointer into f from the mov instruction and the broadcasted values of g from the vmovaps instruction. These are called read after write or RAW dependencies. If we zoom out and consider multiple iterations of this loop, we also notice two RAW dependencies on the previous vfmadd231ps for the accumulated values in ymm0, one for the vmovups instruction that copies the values into memory, and another in the next vfmadd231ps.

also notice that across iterations, the registers ymm1, ymm0 and r9 are reused. Theoretically this inferes another type of dependency on the program, forcing the next iterations of the loop to wait until vfmadd231ps is done with the registers, a latency of 5 cycles, before being able to load into them again:

    mov r9, QWORD PTR 64[rsp]
    vmovaps ymm1, YMMWORD PTR 328[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]

    mov r9, QWORD PTR 72[rsp]
    vmovaps ymm1, YMMWORD PTR 296[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]
    mov r9, QWORD PTR 80[rsp]
    vmovaps ymm1, YMMWORD PTR 264[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax]

    mov r9, QWORD PTR 88[rsp]
    vmovaps ymm1, YMMWORD PTR 232[rsp]
    vmovups YMMWORD PTR [rdx+rax], ymm0
    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r9+rax] 

CPUs cannot deal with RAW dataflow dependencies, because changing the dataflow graph results in changing the underlying logic of the program. But, they are good at dealing with dependencies caused by reuse of registers in successive instructions that do not have a data dependency.

the process is called register renaming. Just like the distinction between logical addresses in virtual memory and physical addresses in physical memory, there is a layer of virtualisation above physical registers that describe the so called logical registers available to programs. They are defined by the CPUs architecture. In the case of our CPU, 16 ymm vector registers and 16 general purpose 64-bit registers. There is no direct correlation of the logical registers to physical registers on the CPU die. ymm0 can be mapped on any physical register. Typically, the number of physical registers is bigger then logical ones. In the current CPU, each core has one 160-entry register file, where each entry is 256-bit wide. A lot are used internally for things like speculative execution. We only have control over the logical registers.

consider the following set of instructions:

     mul r12, r10, r11
     div r13, r11, r10
     add r10, r12, r11 

they are dependent on each other. We can understand these dependencies better if instead of logical register references we refer to the operands as tags, here denoted as t followed by an index. We start with t0, t1, t2, t3 abstracting r10, r11, r12, r13 respectively:

     mul t4, t0, t1
     div t5, t1, t0
     add t6, t4, t1 

every time we encounter a computation, we refer to it's resulting value as a new t tag. So, instead of storing the result of mul into t2(r12), we denote it as a new tag t4.

on die, these tags are returned from a rename file that indexes registers referenced in the program. Each tag has a related ready bit set to 0 if the tag is unresolved and 1 if the instructions preceding it have been executed and value resolved. For every tag write, a new tag is pulled from a free tag pool and a new mapping is written into the rename file such that any future instruction reading the architectural register will be refered to this new tag.

if we apply this logic to the above asm, we get:

    mov g2, QWORD PTR 64[rsp]
    vmovaps y1, YMMWORD PTR 328[rsp]
    vmovups YMMWORD PTR [g1+g0], y0
    vfmadd231ps y2, y0, y1, YMMWORD PTR [g2+g0]

    mov g3, QWORD PTR 72[rsp]
    vmovaps y3, YMMWORD PTR 296[rsp]
    vmovups YMMWORD PTR [g1+g0], y2
    vfmadd231ps y4, y2, y3, YMMWORD PTR [g3+g0]
    mov g4, QWORD PTR 80[rsp]
    vmovaps y5, YMMWORD PTR 264[rsp]
    vmovups YMMWORD PTR [g1+g0], y4 
    vfmadd231ps y6, y4, y5, YMMWORD PTR [g4+g0]

    mov g5, QWORD PTR 88[rsp]
    vmovaps y7, YMMWORD PTR 232[rsp]
    vmovups YMMWORD PTR [g1+g0], y6 
    vfmadd231ps y8, y6, y7, YMMWORD PTR [g5+g0] 

where the g tags are pulled from a GPR pool and the y tags from a ymm register pool. I've also separated the destination operand and the first operand in the vfmadd231ps calls to fit all the necessary tags. Not highlighted, g0 is the rax register and g1 is the rdx register.

this change helps the CPU schedule instructions only when all dependencies have been resolved by just checking the ready bits. It also helps us visualise the data dependencies better. We can tell now that the only data relationship between the iterations is the output register from vfmadd231ps used to move into memory and accumulate.

the first 11 iterations of the loop do not need a vmovaps to broadcast values of g and 8 seemingly random iterations don't need the move to grab the f pointer. The third instruction, vmovups is repeated on every iteration without ever changing the address at which ymm0 is moved into, rdx+rax, thus overwriting it 25 times.

to understand how the instructions could be scheduled, we plot the DFG

even if we have two data dependencies, only one of them dictates when the next iteration can start, vfmadd231ps. It is called the critical path and it's represented by the bold line annotated with the latency.

for benchmarking purposes, let's calculate the instructions per cycle, or IPC. We have 25 mov, vmovaps, vmovups and fma instructions, totalling 25*4 = 100 instructions per kernel call. We expect this call to take 5+25*5+5 = 135 cycles, resulting in 100/135 = 0.74 IPC.

one good thing about this structure is that loop iterations are not inter-dependent, meaning we can theoretically run two loop iterations at the same time if we disregard the number of registers required. I've removed the dependency lines to the mov instructions because the latency is only 1 cycle:

where instructions prefixed with 2 represent the second iteration.

we have 200 instructions per 136 cycles, putting us at 1.47 IPC. Notice, though, that we have redundant load operations. vmovaps 0 and vmovaps 20 refer to the same data loaded in two different registers. If we were to only load it once, it would constitute a read after read (RAR) dependency on that register, which is not considered a type of hazard but on certain architectures can be a 1 cycle latency delay.

on modern chips, registers are usually packed together into physical register files, or PRFs, usually implemented as SRAM. We know from official AMD documentation that each core's FPU register file contains 160 256-bit wide entries. We also know it connects to 4 asymetric execution pipes. Only pipe 0 and pipe 1 can execute FMA instructions, which require three operands, and they each need to borrow a PRF read port from pipe 3, stalling this pipe for one cycle. While the documentation does not specify how many such ports the PRF has, we can guess each pipe has 2 individual read ports and at least one individual write port, totalling 8 read ports and 4 write ports, plus another 2 write ports from cache, each 256-bit wide. Only pipe 2 can send data to the Store Queue at a bandwidth of 256-bits per cycle.

the pipes are also capable of superforwarding, meaning when an instruction is done and the output has to be used in another instruction, it can feed it directly to the pipe through a bypass network, skipping a write and read into the PRF. This enables faster execution of dependent instructions.

the only instructions in our dataflow graph that read data from the PRF are fma and vmovups, and we can expect to execute a maximum of 3 per cycle. Each of our fma instructions occupies 3 read ports, and one vmovups 1 port. PRFs usually come with multiplexers that can handle different reading patterns. We know little about the PRF, but we know nothing about any multiplexers that might be in the area. We can generalize this entire structure like this:

the multiplexer is responsible for routing the data to the respective pipe. It can also route the same data multiple times, but it only needs one PRF port to read it from the register. Or, if the data is available on the bypass network represented by the dotted input lines into the multiplexers, it does need any PRF reads. We can safely assume for our maximum dispatch of 2 fmas per cycle, the multiplexer will be able to handle the routing. This would bring the instruction count to 175 and our IPS to ~1.28.

while we are at the hardware level, let's also talk about how data grabbed directly from memory makes it's way here. Spoiler alert, but during profiling we're going to see some cache misses.

in the representation above, the Memory Load is heavily abstracted. In reality, data can come from either the Integer Execution Unit (EX) or directly from L1D cache. Early in the instruction lifecycle, any required data operand is first put through an Address Generation Unit or AGU, of which we have 3 stored inside the EX, who compute the memory address where the data is. These are separate units that can run in parallel and can be schedued to align with the beginning of our instructions. After we know the address, it is passed to the Load-Store or LS unit to actually read it into cache. Here we have two 256-bit pipes for reading and one 256-bit pipe for writing per cycle.

the complexity of the address calculation can also differ. The AGU can handle base + displacement, base + index and displacement only, any complex addressing mode adds a latency of 1 cycle because they need to be processed by the bigger Algorithmic Logic Unit or ALU, that typically handles normal integers. This would be useful to us if we wanted to get rid of the mov instructions grabbing the pointers into f and simply calculate a base + index + displacement like this:

    vfmadd231ps ymm0, ymm1, YMMWORD PTR [r13+rax+448]

in order to translate virtual memory into physical memory, we use a Translation Lookaside Buffer or TLB, usually implemented as a form of key-value store where the key is the virtual address and the value is the physical address. Because of size limitations, it cannot store all the possible addresses. The L1D TLB stores 64 entries, L2 TLB stores 2048 entries. We only store the most recent translations. If we miss, we have to page walk, look up the address in the page table, adding latency.

another hardware component, the Prefetcher handles prefetches to reduce cache misses. Both the L1 and L2 caches have their own prefetcher. These function independently from the code, but there exist explicit prefetching instructions we will later use to ensure data is cached. Overall, the LS unit on each core can handle up to 22 in-flight cache misses.

modern OSs have different methods for scheduling a queue of processes on the CPU to minimize the amount of time it remains idle. Even in the case a process stalls, another can take it's place and compute until the bottleneck is resolved. This new process could be part of our program if we imploy a form of multiprocessing, or can be any other process in the queue. Take a look at the scheduler on the latest Linux kernel release for battle-tested examples. Note that since version 6.6 the Completely Fair Scheduler (CFS) has been replaced by the Earliest Eligible Virtual Deadline First Scheduler (EEVDF), the documentation is a bit behind.

the previousely broadcasted values in ymm15-2 are inside this register file and don't need to be changed during runtime. At the risk of overflowing the logical registers even more, we can theoretically extend this pipelining to 4 iterations per loop. We'll keep the redundant load instructions for graph clarity:

this would average 4 FMAs per 5 cycles, not bad. We can expect a full run to take 6+5*25+5 = ~136 cycles. This would mean a full batch would take \( (512 * 24 * 4)/4 * 136 = 1253376\) cycles, or 0.272 ms at 4.6GHz. This is a very small amount of time. In order to better benchmark the algorithm, let's work on 516x516 pixel images (512+4). This yields \(((512*512*64)/4)*136 = 570425344\) cycles. At 4.6GHz, this means 124.005 ms with an IPS of 2.38

but while the performance is good, running this would require 4 accumulation registers, one for each convolution iteration, 4 load operations plus the 13 broadcast registers, totalling 21. Even counting hidden registers, we are unlikely to have access to this many of them. And as a general rule of thumb we should always limit ourselves to the registers defined by the ISA.

before profiling, let's take a look at the control flow graph to get an idea of potential overhead and total branches:

each line that doesn't go down means a branch, a CPU prediction and, potentially, a branch miss. The first block .L2 after the function entry point is the entry of both the batch loop and the n_out loop. Next node is the m_out loop, followed by a compiler added branch that separates the two unrolled inner loops either implemented using 8-float ymm registers or 4-float xmm SSE registers.

this section of the code is never executed. we only looked at the ymm version, but they are equivalent. Given that we did not use explicit intrinsic functions or specify what vectorization extension we want in the compiler, a CPUID syscall was used to get information about support. The presence of both SSE and AVX-256 could be the compiler adding a failsafe.

the conditions used for the jump are interesting. Because the compiler shuffles registers a lot, we need an actual debugger like gdb to track the values compared. Here are the values for the first 5 runs:

  cmp rcx, r13 ; initial instruction
  cmp 140737352552448, 140737348333696
      140737352552449, 140737348333808  +1,   +112, -4218641
      140737352552705, 140737348333920  +256, +224, -4218785
      140737352552705, 140737348334032  +0  , +336, -4218673
      140737352552961, 140737348334144  +256, +448, -4218817

  ; the cmp instruction sets the CF flag if the first operand is bigger
  ; because the first operand is incremented more, CF is always set 
  setae cl ; this sets cl to 1 if the CF flag is 0, else cl is 0
  ; cl is now 1 

the CF, or Carry Flag is a single bit, part of the FLAGS register. The ZF or Zero Flag is also in this register. We only jump if both cl and another 8-byte register are both 0:

  or sil, cl ; clears OF, CF, sets SF, ZF, PF
  je .L9     ; only jump if ZF is set 

since cl is always 1, never jump. the second possible jump:

  cmp rax, r14
  cmp 140737348334080, 140737352552464              -4218384
      140737348334192, 140737352552560  +112, +96,  -4218368
      140737348334304, 140737352552656  +224, +192, -4218352
      140737348334416, 140737352552752  +336, +288, -4218336
      140737348334528, 140737352552848  +448, +384, -4218320

  ; in this case rax is smaller then r14
  ; the distance is closing by 16 bytes
  setae al ; same as above 

  ; but we only ever jump if al is equal to 0

  test al, al ; this does a bitwise AND and updates the ZF
  je .L9      ; only jump if ZF is set 

but we never end up reaching that point. This doesn't end up hindering performance because the branch predictor would never expect the jump in the first place. It only ends up making the executable file size bigger. Branch prediction algorithms tend to differ in implementation based on microarchitecture. You can read more about it from Agner Fog's microarchitecture of Intel, AMD and VIA CPUs.


we've made a lot of guesses and assumptions about how the above asm will run on hardware. Let's actually test it out with perf. A short runtime cannot be benchmarked well, so we'll focus on the 516x516 images. Starting with a simple perf stat:

            939.49 msec task-clock                #    0.999 CPUs utilized 
                10      context-switches          #   10.644 /sec         
                 0      cpu-migrations            #    0.000 /sec        
           395,397      page-faults               #  420.861 K/sec       
     4,040,664,781      cycles                    #    4.301 GHz         
       426,071,600      stalled-cycles-frontend   #   10.54% frontend cycles idle
     1,365,582,367      stalled-cycles-backend    #   33.80% backend cycles idle 
     3,483,067,228      instructions              #    0.86  insn per cycle      
                                                  #    0.39  stalled cycles per insn
       432,913,143      branches                  #  460.794 M/sec               
        46,838,132      branch-misses             #   10.82% of all branches     
     2,627,287,887      L1-dcache-loads           #    2.796 G/sec               
        59,443,914      L1-dcache-load-misses     #    2.26% of all L1-dcache accesses 
   <not supported>      LLC-loads                                                             
   <not supported>      LLC-load-misses 

longer runtime then expected, more explicitly:

       914,179,746 ns   duration_time                                 ( +-  0.22% )
       611,075,995 ns   system_time                                   ( +-  0.24% )
       301,904,726 ns   user_time                                     ( +-  0.49% ) 

where system_time tracks code ran in kernel mode, or protection ring 0, and user_time tracks code ran in user mode or protection ring 3. Our above asm doesn't rely on any syscalls so it mostly ran in user mode, with the possible exception of kernel intervention in the case of unexpected signals. This means, with control flow, we average ~300ms. This is most similar to our 2 fma loop, meaning the CPU most likely ran 2 loop iterations at once. With \((512*512*64))/2)*136\), or 1140850688 cycles, at 4.3 GHz a runtime of ~265.3ms.

execution time is dependent on variable clock speed, resulting in a bigger standard deviation between executions, user time is at ~.5%. For a more accurate measurement let's look at exact number of cycles and instructions. Counters suffixed with :u only measure user mode and with :k only kernel mode. No suffix is total:

     4,035,083,395      cycles                                        ( +-  0.06% )
     1,246,281,485      cycles:u                                      ( +-  0.04% )
     2,783,423,550      cycles:k                                      ( +-  0.08% )
     3,516,820,355      instructions       #    0.87  insn per cycle  ( +-  0.01% )
     1,561,793,587      instructions:u     #    1.25  insn per cycle  ( +-  0.00% )
     1,955,045,591      instructions:k     #    0.70  insn per cycle  ( +-  0.02% )

it automatically calculated the instructions per cycle (IPC) for us. We are at ~1.25 IPC. If we compare to our previousely computed IPC values, we are again close to the 2 iterations per cycle at 1.28 IPC. We can annotate each instruction with the number of cycles it ran:

            │2a9:   vmovups      0x0(%r13,%rax,1),%ymm0
   19935745 │       vfmadd213ps  (%rdx,%rax,1),%ymm15,%ymm0
   21064722 │       mov          -0x28(%rsp),%rbx          
   21024866 │       vmovups      %ymm0,(%rdx,%rax,1)       
    3329571 │       vfmadd231ps  (%r15,%rax,1),%ymm14,%ymm0
            │       vmovups      %ymm0,(%rdx,%rax,1)       
   18819871 │       vfmadd231ps  (%r8,%rax,1),%ymm13,%ymm0 
            │       vmovups      %ymm0,(%rdx,%rax,1)       
    2190819 │       vfmadd231ps  (%rdi,%rax,1),%ymm12,%ymm0
            │       vmovups      %ymm0,(%rdx,%rax,1)       
    7761848 │       vfmadd231ps  (%rsi,%rax,1),%ymm11,%ymm0
            │       vmovups      %ymm0,(%rdx,%rax,1)       
    2221530 │       vfmadd231ps  (%rcx,%rax,1),%ymm10,%ymm0
    1090240 │       vmovups      %ymm0,(%rdx,%rax,1)       
   19928532 │       vfmadd231ps  (%rbx,%rax,1),%ymm9,%ymm0 
            │       mov          -0x20(%rsp),%rbx          
            │       vmovups      %ymm0,(%rdx,%rax,1)       
    4433909 │       vfmadd231ps  (%rbx,%rax,1),%ymm8,%ymm0 
    4442593 │       mov          -0x18(%rsp),%rbx          
            │       vmovups      %ymm0,(%rdx,%rax,1)       
    2222789 │       vfmadd231ps  (%rbx,%rax,1),%ymm7,%ymm0 
            │       mov          -0x10(%rsp),%rbx          
   16646526 │       vmovups      %ymm0,(%rdx,%rax,1)       
    3327201 │       vfmadd231ps  (%rbx,%rax,1),%ymm6,%ymm0 
            │       mov          -0x8(%rsp),%rbx           
            │       vmovups      %ymm0,(%rdx,%rax,1)       
    4437493 │       vfmadd231ps  (%rbx,%rax,1),%ymm5,%ymm0 
    4408350 │       mov          (%rsp),%rbx              
            │       vmovups      %ymm0,(%rdx,%rax,1)      
    5529617 │       vfmadd231ps  (%rbx,%rax,1),%ymm4,%ymm0
   14406060 │       mov          0x8(%rsp),%rbx           
            │       vmovups      %ymm0,(%rdx,%rax,1)      
    2220403 │       vfmadd231ps  (%rbx,%rax,1),%ymm3,%ymm0
    1111934 │       mov          0x10(%rsp),%rbx          
            │       vmovaps      -0x58(%rsp),%ymm1        
            │       vmovups      %ymm0,(%rdx,%rax,1)      
    8869069 │       vfmadd231ps  (%rbx,%rax,1),%ymm2,%ymm0
    1110835 │       mov          0x18(%rsp),%rbx          
   15510103 │       vmovups      %ymm0,(%rdx,%rax,1)         
    6654786 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0   
    4450204 │       mov          0x20(%rsp),%rbx             
            │       vmovaps      0x188(%rsp),%ymm1           
            │       vmovups      %ymm0,(%rdx,%rax,1)         
   25503474 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0   
   41043248 │       mov          0x28(%rsp),%rbx             
            │       vmovaps      0x168(%rsp),%ymm1           
            │       vmovups      %ymm0,(%rdx,%rax,1)        
   41032959 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0  
   35496073 │       mov          0x30(%rsp),%rbx             
            │       vmovaps      0x148(%rsp),%ymm1          
            │       vmovups      %ymm0,(%rdx,%rax,1)        
   55430343 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0  
   36557312 │       mov          0x38(%rsp),%rbx            
            │       vmovaps      0x128(%rsp),%ymm1          
            │       vmovups      %ymm0,(%rdx,%rax,1)        
   54343004 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0  
   23301425 │       mov          0x40(%rsp),%rbx            
            │       vmovaps      0x108(%rsp),%ymm1          
            │       vmovups      %ymm0,(%rdx,%rax,1)       
   46562715 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0 
   42099072 │       mov          0x48(%rsp),%rbx           
            │       vmovaps      0xe8(%rsp),%ymm1          
            │       vmovups      %ymm0,(%rdx,%rax,1)      
   53230580 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0
   57640664 │       vmovups      %ymm0,(%rdx,%rax,1)      
   54347370 │       mov          0x50(%rsp),%rbx          
            │       vmovaps      0xc8(%rsp),%ymm1         
            │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0
   47688482 │       mov          0x58(%rsp),%rbx         
            │       vmovaps      0xa8(%rsp),%ymm1        
            │       vmovups      %ymm0,(%rdx,%rax,1)        
   56525529 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0 
   34376712 │       mov          0x60(%rsp),%rbx              
            │       vmovaps      0x88(%rsp),%ymm1             
            │       vmovups      %ymm0,(%rdx,%rax,1)           
   48804709 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0      
   29955169 │       vmovaps      0x68(%rsp),%ymm1            
            │       vmovups      %ymm0,(%rdx,%rax,1)          
   48781630 │       vfmadd231ps  (%r10,%rax,1),%ymm1,%ymm0   
   34421537 │       vmovups      %ymm0,(%rdx,%rax,1)        
   58746184 │       add          $0x20,%rax                 
            │       cmp          $0x800,%rax                
            │     ↑ jne          2a9                        

notice the bottlenecks at the end. Also note perf uses the linux default AT&T assembly syntax. The instructions that have been bolded all took more then 5% of the total amount of cycles individually. These correlate almost exactly with L1D cache misses:

            │2a9:   vmovups      0x0(%r13,%rax,1),%ymm0
     411722 │       vfmadd213ps  (%rdx,%rax,1),%ymm15,%ymm0    
     308326 │       mov          -0x28(%rsp),%rbx              
     295976 │       vmovups      %ymm0,(%rdx,%rax,1)           
      32964 │       vfmadd231ps  (%r15,%rax,1),%ymm14,%ymm0    
            │       vmovups      %ymm0,(%rdx,%rax,1)           
     212195 │       vfmadd231ps  (%r8,%rax,1),%ymm13,%ymm0     
            │       vmovups      %ymm0,(%rdx,%rax,1)           
      49213 │       vfmadd231ps  (%rdi,%rax,1),%ymm12,%ymm0    
            │       vmovups      %ymm0,(%rdx,%rax,1)           
            │       vfmadd231ps  (%rsi,%rax,1),%ymm11,%ymm0    
      32937 │       vmovups      %ymm0,(%rdx,%rax,1)           
      32290 │       vfmadd231ps  (%rcx,%rax,1),%ymm10,%ymm0    
            │       vmovups      %ymm0,(%rdx,%rax,1)           
     310704 │       vfmadd231ps  (%rbx,%rax,1),%ymm9,%ymm0     
      15672 │       mov          -0x20(%rsp),%rbx              
            │       vmovups      %ymm0,(%rdx,%rax,1)           
      33173 │       vfmadd231ps  (%rbx,%rax,1),%ymm8,%ymm0     
      31573 │       mov          -0x18(%rsp),%rbx             
            │       vmovups      %ymm0,(%rdx,%rax,1)          
      32478 │       vfmadd231ps  (%rbx,%rax,1),%ymm7,%ymm0    
      18529 │       mov          -0x10(%rsp),%rbx             
     277959 │       vmovups      %ymm0,(%rdx,%rax,1)          
      99592 │       vfmadd231ps  (%rbx,%rax,1),%ymm6,%ymm0    
            │       mov          -0x8(%rsp),%rbx              
            │       vmovups      %ymm0,(%rdx,%rax,1)          
      81671 │       vfmadd231ps  (%rbx,%rax,1),%ymm5,%ymm0    
      64154 │       mov          (%rsp),%rbx                  
            │       vmovups      %ymm0,(%rdx,%rax,1)        
      36131 │       vfmadd231ps  (%rbx,%rax,1),%ymm4,%ymm0   
     342337 │       mov          0x8(%rsp),%rbx            
            │       vmovups      %ymm0,(%rdx,%rax,1)      
     115567 │       vfmadd231ps  (%rbx,%rax,1),%ymm3,%ymm0
      17432 │       mov          0x10(%rsp),%rbx         
            │       vmovaps      -0x58(%rsp),%ymm1      
            │       vmovups      %ymm0,(%rdx,%rax,1)   
      84685 │       vfmadd231ps  (%rbx,%rax,1),%ymm2,%ymm0     
      66578 │       mov          0x18(%rsp),%rbx              
     263973 │       vmovups      %ymm0,(%rdx,%rax,1)         
     114712 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0  
      65940 │       mov          0x20(%rsp),%rbx           
            │       vmovaps      0x188(%rsp),%ymm1        
            │       vmovups      %ymm0,(%rdx,%rax,1)     
     409424 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0     
     604652 │       mov          0x28(%rsp),%rbx              
            │       vmovaps      0x168(%rsp),%ymm1           
      16251 │       vmovups      %ymm0,(%rdx,%rax,1)        
    1018632 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0 
     620862 │       mov          0x30(%rsp),%rbx          
            │       vmovaps      0x148(%rsp),%ymm1       
            │       vmovups      %ymm0,(%rdx,%rax,1)    
     872968 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0  
     887909 │       mov          0x38(%rsp),%rbx           
            │       vmovaps      0x128(%rsp),%ymm1        
            │       vmovups      %ymm0,(%rdx,%rax,1)     
     846318 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0    
     508933 │       mov          0x40(%rsp),%rbx             
            │       vmovaps      0x108(%rsp),%ymm1          
            │       vmovups      %ymm0,(%rdx,%rax,1)       
     996104 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0  
     454671 │       mov          0x48(%rsp),%rbx           
            │       vmovaps      0xe8(%rsp),%ymm1         
            │       vmovups      %ymm0,(%rdx,%rax,1)     
     675173 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0  
     686926 │       vmovups      %ymm0,(%rdx,%rax,1)       
     785595 │       mov          0x50(%rsp),%rbx          
            │       vmovaps      0xc8(%rsp),%ymm1        
            │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0  
     814627 │       mov          0x58(%rsp),%rbx           
            │       vmovaps      0xa8(%rsp),%ymm1         
            │       vmovups      %ymm0,(%rdx,%rax,1)     
     725484 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0   
     539973 │       mov          0x60(%rsp),%rbx            
            │       vmovaps      0x88(%rsp),%ymm1          
            │       vmovups      %ymm0,(%rdx,%rax,1)     
     920001 │       vfmadd231ps  (%rbx,%rax,1),%ymm1,%ymm0 
     510993 │       vmovaps      0x68(%rsp),%ymm1          
            │       vmovups      %ymm0,(%rdx,%rax,1)       
     735113 │       vfmadd231ps  (%r10,%rax,1),%ymm1,%ymm0 
     667350 │       vmovups      %ymm0,(%rdx,%rax,1)       
     919375 │       add          $0x20,%rax               
            │       cmp          $0x800,%rax             
            │     ↑ jne          2a9                    

most vfmadd231ps miss data cache. Of the three operands, two are loaded into registers and don't use the data pipeline. The third pointer operand is responsible for all the misses. All but one mov instruction miss too, at points even more often. One of the highest miss rates is for the last add instruction.

it's worth taking the time to discuss how we arrive at the above numbers. Most modern chips include special hardware registers called Hardware Performance Counters (HPCs) or Performance Monitoring Counters (PMCs). Their role is to increment every time another hardware component is in use. A simple example would be the clock, where it increments on every cycle pulse. PMCs can either be fixed to a single component or programmable to multiple. To control which event we track, every such register comes with a configuration register that serves as the interface.

grouped together these registers form the Performance Monitoring Unit or PMU. PMUs are implemented as Model-Specific Registers (MSRs) and vary for each CPU. The x86 ISA defines the instructions WRMSR for writing and RDMSR for reading from such registers. Some counters can also be accessed by the RDPMC instruction. The number of PMCs is usually small, for the processor at hand there are only 6, but we have a lot of events we can track with them, usually described in the vendor documentation. Such events can be tracked in parallel, resulting in the PMCs being multiplexed. This reprogramming during execution adds overhead.

throughout the lifespan of a longer program, these registers overflow, resulting in an interrupt. Such interrupts can be tracked in software, and together with metadata like the instruction pointer (IP), they can keep track of where the program is at a stable interval. This is called sampling. The interval can be manually set by initializing the PMU to a known value. For example, if we wanted to count every 100ms at 4.6GHz and the counter was 32 bits wide, we would initialize it to a value of 0xffffffff - 0x0491d72a (76666666 decimal), the highest possible value minus the number of cycles in 100ms.

user space tools such as perf record can record samples at a specific rate per second, given in Hz and perf report displays it nicely like above. The event is chosen given it's vendor specific hex value and a hex mask. perf attaches text labels to these values to make them easier to use. perf list --details gives each label's hex values.

keep in mind the counters are exposed to a noisy environment. Different iterations of perf record can vary quite a bit. For example, the same process can be fed differently to the CPU from the scheduler depending on other running processes, once the data is dispatched to hardware, the CPU might choose to schedule the instructions differently depending on other running threads, availability of data, etc. This is not to mean that process execution is not deterministic. We could reduce the noise by doing stuff like disabling simultaneous multithreading (SMT), frequency scaling, etc. Find more ways here.

let's take a look at the cache structure. We'll keep counters to user time for now, and later look at an strace. The second call below is to event 0x60 with mask 0xc8 to get the total L2 cache accesses from L1D misses. The label used by perf is l2_cache_accesses_from_dc_misses, but we'll use the hex for clarity. We feed this to perf as r[mask][event]:

     perf stat -e r40,rc860,r7064,r0864 ./a.out
     1,841,496,575      r40:u    - L1 Data Cache Accesses 
        29,525,366      rc860:u  - L2 Cache Accesses from L1 Data Cache Misses 
        28,569,645      r7064:u  - L2 Cache Hits from L1 Data Cache Misses 
         8,656,247      r0864:u  - L2 Cache Misses from L1 Data Cache Misses 

only 0.47% of data accesses miss the L2 cache. I've been unable to find any decent L3 events in the AMD documentation, but we'll see explicit DRAM accesses later. Let's look at prefetches:

        15,706,530      r0260:u  - L2 prefetches 
         7,373,473      rff70:u  - L2 prefetch hit in L2
            72,700      rff71:u  - L2 prefetcher hits in L3
         7,404,641      rff72:u  - L2 prefetcher misses in L3

             1,327      r0152:u  - Software PREFETCH instruction saw a DC hit 
        11,332,679      r015a:u  - Hardware Prefetch Data Cache Fills by L2
           101,799      r025a:u  - Hardware Prefetch Data Cache Fills by 
                                   another cache (home node local)
           800,413      r085a:u  - Hardware Prefetch Data Cache Fills by 
                                   DRAM (home node local) 

note these are averaged over 200 separate runs and not multiplexed. Also note the first 4 trackers include both data and instruction cache. We only care about the data cache for now. Hardware prefetches are executed by the Prefetcher components on the CPU, while software prefetches are instructions inserted by the programmer or compiler. We haven't used them much. Hardware prefetches have less overhead. We can see that most hit in L2, but ~800.000 had to fill from DRAM. Since these are prefetches they are not as likely to halt the process in wait of data, but they do total a cumulative ~50ms at 66ns latency.

x86 defines the prefetchh instruction that takes the T0 hint for temporal data prefetched in all cache levels, T1 hint for first-level cache, T2 hint for second-level cache and NTA hint to prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution. The x86 ISA does not specify how to implement the NTA hint, so it's microarchitecture dependent. There is also a prefetchw instruction for prefetching an address we are going to write to. Each of these instructions fetch one cache line from memory, the size of which is also microarchitecture dependent. For this CPU, the cache line is 64 bytes for all levels.

since software prefetches do have more overhead compared to hardware ones, you should avoid manually prefetching in a place where the hardware would have done it anyway. Linear memory access streams or evenly strided ones are often properly prefetched. Non-regular access patterns, like an sgemm, are not usually handled correctly. There are also some cases where there are too many different streams for the hardware to keep track of, in which case manual prefetching can help. If unsure, profile with perf or similar tools.

in our case, we handle a sub-matrix of length 12 and 5 rows, where each row is separated by a stride in linear memory. As the source tensor gets bigger, this stride gets larger and becomes more difficult to handle. Bigger stride means potential cache misses, TLB misses or page faults. If the stride is big enough, each of our rows might be in different pages. This is not a very common pattern, and is generally augmented with software prefetches. One cache line of 64 bytes gives us 16 single precision values, which is enough to cover our length of 12. Some implementations take this even further and copy the data linearly into a memory buffer to access it efficiently. This technique is called packing. I've gone over it in my previous post about sgemms.

we can look at the number of demand data cache fills to know how many requests had to be made for data not found in the cache:

            53,655      r0243:u  - Demand Data Cache Fills by Hit in cache; 
                                   local CCX (not Local L2), or Remote CCX and the 
                                   address's Home Node is on this thread's die
         1,507,027      r0143:u  - Demand Data Cache Fills by Local L2 hit 
            19,610      r0843:u  - Demand Data Cache Fills by DRAM or IO 

we have a relatively low number of fills from DRAM or IO. If they are all DRAM hits this puts us at only ~1.29ms or at 4.3GHz, 5565318 cycles. Most misses hit in L2.

now let's look at the L1 and L2 data TLB:

         1,261,477      rff45:u  - All L1 DTLB Misses or Reloads 
            72,632      r0145:u  - DTLB reload to a 4K page that hit  in the L2 TLB 
         1,185,878      r1045:u  - DTLB reload to a 4K page that miss in the L2 TLB

         1,188,991      rf045:u  - L2 DTLB Misses & Data page walks 

we miss a lot of TLB loads and page walk. Unfortunately, the only TLB specific instruction x86 defines is invlpg, used to invalidate entries. We can force an entry by a software prefetch. Hardware preferches are not allowed to execute prefetches into different pages because they might initiate table walks that are invalid. In order to lower misses, we need to better control memory access patterns.

let's talk a bit about virtual memory pages and efficient access patterns. Keep in mind this might be OS dependent. For this example, we are looking at linux 6.5.0-41-generic. For a primer in the linux memory management system or general virtual memory, check out this.

the virtual memory space is split into smaller chunks called pages for easier management. The size of a page is usually 4KB, or 4096 bytes. The same is done for physical memory. To keep track of the mapping of one virtual memory page onto a physical memory page, we assign each a number called the page frame number (FPN) and we keep a table. A virtual frame could be pointing to any physical frame. A virtual memory address contais a page number and an offset into that page. When an instruction requests data from such a virtual address, the CPU needs to translate it into the relevant position in physical memory.

at this stage the CPU tries to find this mapping by checking for the virtual page number in the L1 TLB, usually implemented as a hash map in SRAM. If we hit, we can translate and begin loading data in ~0.5-1 cycles. If we miss, we search the L2 TLB. If we also miss here, we have to page walk, go through the Page Table stored in main memory (DRAM) to find the relevant physical page. For Zen 2, a main memory access has a 38 cycles + 66ns latency, plus the latency of the table search. If the mapping is found, the data is loaded into the TLB and the CPU restarts the instruction. If a mapping is not found, we have hit a page fault. Here, we check if the virtual address is valid. If it is not, the OS kills the process to protect other processes. If it is valid, we have to load the page from disk, a very long process.

because we are dealing with a relatively small amount of data, all of it fits in DRAM and we don't have to deal with page faults. If, for example, we were dealing with a bigger resolution images or we had a smaller amount of available DRAM (average in a mobile phones is ~4GB), we would have to consider the possibility. We handle ~1.08 GB, or 1,082,163,300 bytes for single precision.

simple TLBs only store the most recent translated pages and discard older ones. Modern TLBs have more advanced eviction policies. A search in the TLB compares the page number of the virtual address with virtual page numbers stored in the cache. Multiple searches in the same page hit the same entry in the TLB. It follows that an efficient access pattern is one that reuses the same cache entries as much as possible. For this reason, the size of a page can sometimes be changed. Most x86 processors support 4KiB, 2MiB or 1GiB pages. This is determined by a hardware component called the Memory Management Unit or MMU. Read more about changing page size on linux here. Also take a look at this experiment to better understand pages. We won't be changing page size.

let's model the TLB entries of our code. Each kernel call we manage a 12x5 sub-matrix. A single cache line per row loads 16 values, enough to cover the 12 columns. This puts us at 5 cache lines. We also load 11 32-byte broadcasted filter variables and 17 4-byte addresses. This gives us another 6 and 2 cache lines respectively, adding to 13. We also have a write instruction that fills the output array, address we need to fetch into memory. So 14 cache lines total.

every cache line loaded from a different page is stored as an entry in the TLB. The 5 lines from the sub-matrix are separated by a stride \(S\) given by the size of the images. In the present case of 516x516, the stride is 516 floats, or 2064 bytes. Considering 4KiB pages of 4096 bytes each, this only gives us one or two hits per page, depending on the displacement of data. The max number of pages used is given by starting at index 4092: \[ \text{ciel}((4092+2064*5)/4096) = \text{ceil}(3.51)\] meaning 4 pages per kernel call. To this we add the filter values and addresses. Since they are stored on the stack, we can consider them in the same page for a total of 5 pages. Lastly, we need to add the write line to the output array in a different page. So each kernel call handles 6 pages.

subsequent kernel calls work on sequential memory, so these 6 pages are reused. In the case of the sub-matrix, given we start at the beginning of the page, we can fit ~85 kernel calls of 48-byte (12) displacement in 4096 page bytes until they need to request the next pages. But, since the column size is smaller then a cache line, there is some leftover data. There are only 64 cache lines in a page, and thus only 64 chances to miss. In the case of the write instruction, we can fit ~128 32-byte (8 float) writes. The stack page is consistently reused.

we expect 4 misses every 85 calls, for \(512*512*64\) calls and we also add one page fault for the write every 128 calls. We total: \[((512*512*64)/85)*4 + ((512*512*64)/128) = 920,588.04\] page misses at worst. Keep in mind that even though we have 25 different write instructions per kernel call, they all point to the same write address. Of the original 1,188,991 page walks, only ~65.83% are attributed to our function. This would mean ~782,712 page walks. Our prediction is within margin of error. Given 3 pages per sub-matrix (much more likely), we would have 723,209.

if we were to pack the sub-matrix data sequantially into memory, we could reduce the average of 3 page misses per ~85 calls to only 1 page miss per ~68 calls. At a main memory latency of 38 + 66ns, we can easily see for a large amount of data it covers the latency of moving the data into the new buffer. It would mean: \[((512*512*64)/68) + ((512*512*64)/128) = 377,795.76\]

number which we could alleviate with well placed software prefetches.

this seems to be our biggest bottleneck. A look at perf report --stdio shows the main dependency of our function is asm_exc_page_fault, at 15.26% of our runtime.

I'll manually annotate some of these trackers to get a better view. The percentage in parentheses is the total number of the tracked counter made up by our naive function:

   rf045:u - L2 DTLB Misses & Data page walks (65.83%)
   r0843:u - Demand Data Cache Fills by DRAM or IO (70.86%) 
   r0864:u - L2 Cache Misses from L1 Data Cache Misses (67.44%)
     cycles |   rf045:u |  r0843:u |  r0864:u |
   21186425 |      6534 |      200 |    41966 |  14e0:   vfmadd213ps
   12315021 |       580 |          |    76442 |  14e6:   mov        
   11831880 |      5890 |      255 |     3928 |  14fb:   vfmadd231ps
   12339458 |      5886 |      356 |     2949 |  1527:   vfmadd231ps
   11842702 |      6308 |      151 |     3280 |  1552:   vmovups    
    9943971 |      4780 |      240 |     4600 |  157c:   mov        
    8532289 |      5560 |      128 |     4275 |  15a7:   vmovups   
   12798017 |      7572 |      445 |     3932 |  15c5:   vfmadd231ps
   26088788 |      8253 |      210 |     5910 |  15cb:   mov    
   26913363 |     13695 |      577 |    11135 |  15de:   vfmadd231ps
   23698430 |      8515 |      790 |     9500 |  15e4:   mov   
   31903337 |     12451 |      455 |    15763 |  15f7:   vfmadd231ps
   21300825 |     11440 |      298 |     5238 |  15fd:   mov    
   31703956 |     16838 |      714 |    10834 |  1610:   vfmadd231ps
   22117032 |     13021 |      371 |     7218 |  1616:   mov    
   29790187 |     18273 |      518 |     9846 |  1629:   vfmadd231ps
   25738146 |     12351 |      388 |     7865 |  162f:   mov   
   35969861 |     15904 |      630 |     9492 |  1642:   vfmadd231ps 
   33661792 |     14778 |      876 |     8852 |  1648:   vmovups 
   35861719 |     13311 |      664 |    14431 |  164d:   mov   
   34629506 |     14216 |      537 |     6898 |  1661:   mov  
   34054336 |     15462 |      604 |    11492 |  1674:   vfmadd231ps 
   23186577 |      8957 |      502 |     5257 |  167a:   mov 
   32085922 |     16197 |      594 |    12468 |  168d:   vfmadd231ps
   18457640 |      9950 |      494 |     9847 |  1693:   vmovaps
   37925127 |     16762 |      891 |    11131 |  169e:   vfmadd231ps
   22714021 |      8991 |      355 |     6620 |  16a4:   vmovups 
   28444358 |     17139 |      836 |    13106 |  16a9:   add  

we'll take a look at the strace to examine the syscalls a bit later.

     6,888,823,080      fp_ret_sse_avx_ops.mac_flops              (36.37%)

     1,163,859,706      ex_ret_mmx_fp_instr.sse_instr             (36.81%)
                26      fp_disp_faults.ymm_fill_fault             (37.25%)
                16      fp_disp_faults.ymm_spill_fault            (37.35%)

     1,209,644,176                 (46.28%)
       295,844,295      fpu_pipe_assignment.total0                (45.84%)
       281,644,302      fpu_pipe_assignment.total1                (54.74%)
       564,730,678      fpu_pipe_assignment.total2                (54.29%)
        70,654,705      fpu_pipe_assignment.total3                (53.85%)

        22,441,966      fp_num_mov_elim_scal_op.sse_mov_ops       (53.40%) 

theoretical maximum

given an infinite number of registers with infinite bandwidth per cycle and infinite execution pipelines, we could theoretically load all required data by cycle 0 and compute all required multiplications by cycle 1, then sum it all and be happy. But for the sake of realism, let's limit the number of decoded FMA instructions to 2 per cycle. The data graph becomes:

let's calculate how much data we are pulling in the 0th cycle. One vmovaps moves 32 bytes from L1D and the mov bring in 8 bytes (hopefully from L1D). This puts us at 32*2+8*2 = 80 bytes loaded from memory. Each CPU has a limit to how much one cycle can load, mine is capped at 64 bytes. This means our algorithm is memory-bound.

and that's not all. let's calculate how many registers we would fill until cycle 10, when the first FMA0 finishes and we can realocate it's resources. Each cycle loads vmovaps into 2 different ymm registers. By cycle 10, this yields 20 registers. Then, in order for each individual FMA to not be data dependent on previous ones, we have to provide a different accumulation register. By cycle 10, we already decode 10 FMAs. this puts us at 30 ymm registers before we can start reusing them. The vast majority of consumer CPUs, including mine, only have 16 of these registers on the chip, so we are also compute bound.

if we did have these resources available, one 25 FMA loop would only take 18 cycles. If we ignore loop mechanics and the CPU frontend, 512*3*24*18 = 663552 cycles for the entire batch. At a clock rate of 4.6 GHz the runtime would be 0.144 milliseconds. A batch of 512 4100x4100 images is 4.201 seconds.

we need to split the loads into 2 cycles and only run one FMA:

this puts us at 40 bytes load per cycle and 15 ymm registers, 10 for broadcasts and 5 for accumulation. It also requires 10 general purpose registers for mov.

a full loop would take 30 cycles. Standard batch at 4.6GHz would be 0.240 ms. A 4100-pixel image batch would be 7.002 s.

the above will do for our example, but theoretically if we were capped at 32 bytes per cycle, the standard on a Zen1 microarchitecture CPU, we would lose one cycle every four operations: