AMD’s GCN architecture propelled the company’s Radeon Graphics Division for almost a decade. Although it had its strong points such as a powerful Compute Engine, hardware schedulers, and unified memory, it wasn’t very efficient. The hardware utilization was quite poor compared to rival NVIDIA parts, the CU count was limited to 64 and scaling dropped sharply after the first 11 CUs per shader engine.
In gaming especially, the IPC and latency were below the industry standard which is why AMD GPUs repeatedly failed to keep with NVIDIA’s high-end products. The RNDA based Navi GPUs aim to rectify the drawbacks of the GCN design. The Radeon RX 5000 series graphics cards are built from the ground up for gaming and feature a much more fine-tuned pipeline.
RDNA is the core architecture and Navi is the codename of the graphics processors built using it. Similarly, GCN was the architecture and Vega was the codename of the older GPUs.
The 1st Gen RDNA architecture powering the Navi 10 and Navi 14 GPUs (Radeon RX 5500, 5600 and 5700) are based on the same building blocks as GCN: Scalar compute, scalar memory, vector compute, vector memory, branches, export, and messages. The core difference is that RDNA reorganizes the fundamental components of GCN for a higher IPC, lower latency and better efficiency. That’s what Navi is all about: It does a lot more with notably less hardware!
Dual Compute Architecture and Wave32
AMD’s GCN graphics architecture consisted of 64 wavefronts (work-items) per Compute Unit. These were divided into four SIMDs (Single Instruction On Multiple Data Types) per CU with each comprising of 16 cores of ALUs. The SIMDs took four clock cycles to complete one full wavefront. When all the cores execute the same instruction, this was very efficient, as they’d finish their workloads simultaneously. However, unlike CPUs, GPUs have diverse workloads.
GCN Compute Unit
And often, one wave consisted of different kinds of work-items, and while some of them required four clock cycles to execute, many of them needed just one of two. This left much of the SIMD underutilized, making it hard to fully saturate.
RDNA Dual Compute Unit
The RDNA architecture leveraged in Navi uses wave32, a narrower wavefront with 32 work-items. It is much simpler and efficient than the older wave64 design. Each SIMD consists of 32 shaders or ALUs, twice that of GCN. This allows the execution of one whole wavefront in one clock cycle, reducing bottlenecks and increasing IPC by 2x. By completing a wavefront 4x faster, the registers and cache are freed up much faster. Furthermore, wave32 uses half the registers compared to wave 64, reducing circuit complexity.
To accommodate the narrower wavefronts, the vector register file has also been reorganized. Each vector general-purpose register (vGPR) now contains 32 lanes that are 32-bits wide, and an SIMD contains a total of 1,024 vGPRs – again, 4X the number of registers as in GCN.
Overall, the narrower wave32 mode increases the throughput by increasing the IPC and the total number of wavefronts, resulting in significant performance and efficiency boosts.
To ensure compatibility with the older GCN instruction set, the RDNA SIMDs in Navi support mixed-precision compute. This makes the new Navi GPUs suitable for not only gaming workloads (FP32), but also for scientific (FP64) and AI (FP16) applications. The RDNA SIMD improves latency by 2x in wave32 mode and by 44% in wave64 mode.
Asynchronous Compute Tunneling
One of the main highlights of the GCN architecture was the use of the Asynchronous Compute Engines way before NVIDIA integrated it into their graphics cards. RDNA retains that capability and doubles down on it.
The Command Processor handles the commands from the API and then issues them to the respective pipelines: The Graphics Command Processor manages the graphics pipeline (shaders and fixed-function hardware) while the Asynchronous Compute Engines (ACE) take care of the Compute. The Navi 10 die (RX 5700 XT) has one Graphics Command Processor and four ACEs. Each ACE has a distinct command stream while the GCP has individual streams for every shader type (domain, vertex, pixel, raster, etc).
The RDNA architecture improves parallel processing at an instruction level by introducing a new feature called Asynchronous Compute Tunneling. Both GCN and the newer Navi GPUs support asynchronous compute (simultaneous execution of graphics and compute pipelines), but RDNA takes a step further. At times one task (graphics or compute) becomes far more latency-sensitive than the rest.
In GCN based Vega designs, the command processor could prioritize compute over graphics and spend less time on shaders. In the RDNA architecture, the GPU can completely suspend the graphics pipelines, using all the resources for high-priority compute tasks. This significantly improves performance in the most latency-sensitive workloads such as virtual reality.
Scalar Execution for Control Flow
Most of the computation in AMD’s GCN and RDNA architectures is performed by the SIMDs which happen to be vector in nature: perform single instruction on multiple data types (basically do everything from INT to FP). However, there are scalar units in each CU as well.
Each SIMD contains a 10KB scalar register file, with 128 entries for each of the 20 wavefronts. A register is 32-bits wide and can hold packed 16-bit data (integer or floating-point) and adjacent register pairs hold 64-bit data. When a wavefront is initiated, the scalar register file can preload up to 32 user registers to pass constants, avoiding explicit load instructions and reducing the launch time for wavefronts.
The 16KB write-back scalar cache is 4-way associative and built from two banks of 128 cache lines that are 64B. Each bank can read a full cache line, and the cache can deliver 16B per clock to the scalar register file in each SIMD. For graphics shaders, the scalar cache is commonly used to stored constants and work-item independent variables.
Cache: L0 & Shared L1
While the old GCN and rival NVIDIA GPUs rely on two levels of cache: RDNA adds a third L1 cache in the Navi GPUs. Where the L0 cache is private to a CU, the L1 cache is shared across a group of Dual Compute Units. This reduces costs, latency and power consumption. It reduces the load on the L2 cache. In GCN, all the cache misses of the per-core L1 cache were handled by the L2 cache. In RDNA, the new L1 cache centralizes all caching functions within each shader array.
Any cache misses that happen in the L0 caches pass to the L1 cache. This includes all the data from the instruction, scalar and vector caches, in addition to the pixel cache. L1 is a read-only cache and each is composed of four banks, resulting in a total of 128KB. It is a 16-way set associative cache memory. The L1 cache is backed by the L2; a write to L1 will be invalidated and copied to L2 or memory.
Related: What is CPU Cache Memory: L1, L2, and L3 Cache Explained
The L1 cache controller coordinates memory requests and forwards four per clock cycle, one to each L1 bank. Like in any other cache memory, L1 misses are serviced by the L2 cache.
Dual Compute Unit Front End
Each Compute Unit fetches instruction via the Instruction Memory Fetch. In GCN, the instruction cache was shared between four CUs, but in RDNA (Navi), the L0 instruction cache is shared amongst the four SIMDs in a Dual CU. The instruction cache is 32KB and 4 -way set-associative. Like the L1 cache, it is organized into four banks of 128 cache lines, each 64-bytes long.
The fetched instructions are deposited into the wavefront controllers. Each SIMD has a separate instruction pointer and a 20-entry wavefront controller, for a total of 80 wavefronts per dual compute unit. Wavefronts may be different from a work-group or kernel. Although there may be a higher number of wavefronts, a dual compute unit runs 32 workgroups simultaneously.
As already mentioned, where GCN requested instructions once every four cycles, Navi does it every cycle (2-4 ins per cycle). After that, each SIMD in an RDNA based Navi GPU can decode and issue instructions every cycle as well, increasing the throughput and reducing latency by 4x over GCN.
To accommodate the new wave32 mode, the cache and memory pipeline in each RDNA SIMD has also been revamped. The pipeline width has been doubled compared to GCN based Vega GPUs. Every SIMD has a 32-wide request bus that can transmit the address for a work-item in a wavefront directly to the ALUs or the vGPRs (Vector General Purpose Registers).
A pair of SIMDs share a request and return bus, however, a single SIMD can receive two chunks of 128-byte cache lines per clock: one from the LDS (Load-Store) and the other from the Vector L0 cache.
Video Encode and Decode
Like NVIDIA’s Turing encoder, the Navi GPUs also feature a specialized engine for video encoding and decoding.
In Navi 10 (RX 5600 & 5700), unlike Vega, the video engine supports VP9 decoding. H.264 streams can be decoded at 600 frames/sec for 1080p and 4K at 150 fps. It can simultaneously encode at about half the speed: 1080p at 360 fps and 4K at 90 fps. 8K decode is available at 24 fps for both HVEC and VP9.
7nm Process and GDDR6 Memory Standard
While the 7nm node and GDDR6 memory are often advertised as part of the new architecture, these are third-party technologies and aren’t exactly part of the RDNA micro-architecture. They are supported and the GPUs are designed to work with them.
TSMC’s 7nm node does, however, improve the performance per watt significantly over the older 14nm process powering the older GCN designs, namely Polaris and Vega. It increases the performance per area by 2.3x and the performance per watt metric is boosted by 1.5x.