Monday, February 20, 2017

Inside AMD GCN code execution


AMD's Graphics Core Next architecture was introduced over five years ago.  Although there have been many documents written to help developers understand the architecture, and thereby write better code, I have yet to find one that is clear and concise.  AMD's best GCN documentation is often cluttered with unnecessary details on the old VLIW architecture, when the GCN architecture is already complicated enough on it's own.  I intend to summarize my research on GCN, and what that means for OpenCL and GCN assembler kernel developers.

As shown in the top diagram (GCN Compute Unit), the GPU consists of groups of four compute units.  Each CU has four SIMD units, each of which can perform 16 simultaneous 32-bit operations.  Each of these 16 SIMD "lanes" is also called a shading unit, so the R9 380 with 28 CUs has 28 * 4 * 16 = 1792 shading units.

AMD's documentation makes frequent reference to "wavefronts".  A wavefront is a group of 64 operations that executes on a single SIMD.  The SIMD operations take a minimum of four clock cycles to complete, however SIMD pipelines allow a new operation to be started every clock.  "The compute unit selects a single SIMD to decode and issue each cycle, using round-robin arbitration." (AMD GCN whitepaper pg 5, para 3).  So four cycles after SIMD0 has been issued an instruction, the CU is ready to issue it another.

In OpenCL, when the local work size is 64, the 64 work-items will be executed on a single SIMD.  Since a maximum of four SIMD units can access the same local memory (LDS), AMD GCN devices support a maximum local work size of 256.  When the local work size is 64, the OpenCL compiler can leave out barrier instructions, so performance will often (but not always) be better than using a local work size of 128, 192, or 256.

The SIMD units only perform vector operations such as mulitply, add, xor, etc.  Branching for loops or function calls is performed by the scalar unit, which is shared by all four SIMD units.  This means that when a kernel executes a branch instruction, it is executed by the scalar unit, leaving a SIMD unit available to perform a vector operation.  The two operations (scalar and vector) must come from different waves, so to ensure the SIMD units are fully utilized, the kernel must allow for 2 simultaneous wavefronts to execute.  For information on how resource usage such as registers and LDS impacts the number of simultaneous wavefronts that can execute, I suggest reading AMD's OpenCL Optimization Guide.  Note that some sources state that full SIMD occupancy requires four waves, when it is technically possible with just one wave using only vector instructions.  Most kernels will require some scalar instructions, so two waves is the practical minimum.