Deleon Karen

Posted on Jun 2

Part 7: The Evolution of Command Submission: From Ringbuffer to GuC

#architecture #linux #systems #tutorial

In the previous lecture, we explored the basic architecture of the GT (Graphics Technology) and how i915_gem_context preserves GPU execution state much like an operating system process. But with state and rendering commands in place, through what physical pathway do user-mode draw requests actually get fed into the GPU engines for execution?

The Command Submission mechanism is one of the most dramatically evolved and frequently refactored modules in the i915 driver. If you look at the kernel source directory drivers/gpu/drm/i915/gt/, you'll find that three entirely different submission flows coexist within the driver. This evolution is, at its core, a technological revolution centered on "how to free the CPU from heavy scheduling work and allow the GPU to become autonomous."

1. The Ancient Era: Legacy Ringbuffer Submission

On early Intel GPUs (before Gen8/Broadwell), the command submission mechanism was very simple and direct, known as the Ringbuffer mode. This code now resides in gt/intel_ring_submission.c.

1.1 A Simple "Producer-Consumer" Model

Early hardware was quite "dumb." Each hardware engine (such as the Render engine, Blitter engine) had only one global ring memory buffer (Ringbuffer).

CPU (Producer): Writes the memory address of the Batch Buffer (batch processing commands) sent from user mode, along with necessary configuration instructions, sequentially into the Ringbuffer, and updates the TAIL register.
GPU (Consumer): The hardware continuously reads and executes instructions from the position indicated by the HEAD register until HEAD catches up with TAIL.

1.2 Bottlenecks and Pain Points

The drawback of this model was obvious: a single global queue.
If you had multiple OpenGL applications (different Contexts) running simultaneously, the driver had to carefully insert lengthy instructions for "saving Context A state -> restoring Context B state" into the same Ringbuffer. This led to extremely high context switch latency and made high-priority preemption nearly impossible.

2. The Mesozoic Era: Execlists (Software-Based Scheduling)

To support virtualization and more efficient multi-tasking concurrency, starting with the Gen8 (Broadwell) architecture, Intel introduced the LRC (Logical Ring Context) and Execlists (Execution Lists) mechanisms. The core code resides in gt/intel_execlists_submission.c.

2.1 Hardware Upgrade: One Ringbuffer Per Context

By this era, the hardware had finally grown smarter. The GPU allowed each independent Context to have its own private Ringbuffer, rather than everyone cramming into a single global queue.

2.2 The Driver Bears the "Scheduler" Burden

Although the hardware now supported multiple Contexts, it no longer went looking for tasks on its own. The hardware provided a register port called the ELSP (Execlist Submission Port).
The i915 driver was forced to become a complex software scheduler:

The driver maintained a red-black tree on the CPU side, sorting all pending i915_request objects by priority.
The driver used the CPU to calculate who should run next, then wrote the descriptors of the 1–2 highest-priority Contexts into the ELSP.
Upon receiving the descriptors, the GPU automatically performed a hardware-level Context switch (much faster than before) and began executing the Ringbuffer corresponding to that Context.

Pain Point: Excessive CPU Overhead. Every time a task completed or a higher-priority task arrived, the GPU would send an interrupt to the CPU. The CPU had to immediately respond to the interrupt, recalculate the red-black tree, and write to the ELSP again. At very high game frame rates, this CPU-side scheduling overhead (Driver Overhead) became intolerable.

3. The Modern Era: GuC Hardware Microcontroller Scheduling

To thoroughly solve the CPU scheduling bottleneck, Intel introduced the GuC (Graphics Microcontroller). Gradually trialed starting with Gen11 (Ice Lake), by Gen12 (Tiger Lake) and the latest Xe architecture, GuC has become the default and only submission method. The code resides in gt/uc/intel_guc_submission.c.

3.1 What is GuC?

GuC is a low-power ARM-architecture (or proprietary architecture) microcontroller integrated directly onto the GPU silicon die. It runs proprietary, closed-source firmware provided by Intel. It takes over all the scheduling work that was previously done by the i915 driver on the CPU.

3.2 True Asynchrony and Autonomy

In the GuC era, the interaction between the i915 driver and the hardware becomes extremely elegant:

Workqueue: The driver and GuC share a block of memory as a communication Workqueue.
Doorbell: When a new draw command arrives from user mode, the i915 driver simply drops the request into the corresponding Context's Ringbuffer, leaves a note in the Workqueue, and then "rings" the GuC's Doorbell register (an extremely lightweight MMIO write operation).
Hands-Off Completely: After ringing the doorbell, the CPU can go off and do other things. The remaining tasks—context selection, priority preemption, and even load balancing between engines—are all computed and assigned in real-time by the GuC's internal firmware.

3.3 A Leap in Performance

With GuC submission:

The number of interrupt requests handled by the CPU is significantly reduced (from tens of thousands per second to a few thousand per second or even lower).
Preemption latency (from initiating preemption to the GPU actually switching) drops to the microsecond level, which is critical for VR and smooth desktop compositing (Wayland/KMS).

3.4 Implementation Architecture

the GuC (Graphics micro-Controller) command submission method is primarily implemented through the Command Transport Buffers (CTB) message mechanism and Logical Ring Context (LRC) state updates.

3.4.1 Core Components

guc_id: Each context managed by the GuC has a unique ID. The GuC uses this ID to identify and schedule different submission streams.
LRC (Logical Ring Context): The context state stored in memory. i915 notifies the hardware of new commands by updating the TAIL register mirror in the LRC.
CTB (Command Transport Buffers): A bidirectional circular buffer running between the Host and the GuC.
- H2G (Host to GuC): The driver sends commands or notifications to the GuC.
- G2H (GuC to Host): The GuC returns execution results or status updates.

3.4.2 Submission Flow (intel_guc_submission.c)

When an i915_request is ready to be submitted, the main function called is guc_submit_request.

A. Update LRC Tail

The driver first updates the LRC mirror in memory corresponding to the request via guc_set_lrc_tail:

static inline void guc_set_lrc_tail(struct i915_request *rq)
{
    rq->context->lrc_reg_state[CTX_RING_TAIL] =
        intel_ring_set_tail(rq->ring, rq->tail);
}

This step writes the latest Ring Buffer tail position to memory, but the GuC is not yet aware of it at this point.

B. Trigger Scheduling Notification

The driver then needs to inform the GuC that the context has new work. This is achieved through __guc_add_request:

Context Enablement: If the context has not yet been enabled in the GuC, it sends an INTEL_GUC_ACTION_SCHED_CONTEXT_MODE_SET message (carrying GUC_CONTEXT_ENABLE).
New Request Notification: If the context is already active, it sends an INTEL_GUC_ACTION_SCHED_CONTEXT message.
H2G Send: It calls intel_guc_send_nb (non-blocking send) to place the action code into the H2G circular buffer.

C. Task Scheduling (Tasklet vs. Direct)

To improve efficiency, the driver attempts to "bypass" the task scheduler (Tasklet):

If there is no current backlog and the context is already registered with the GuC, it directly calls guc_bypass_tasklet_submit.
If the conditions are not met, the request is placed into a priority queue via queue_request, and a tasklet is scheduled to handle it later.

3.4.3 Communication Mechanism (intel_guc_ct.c)

Messages are ultimately encapsulated via the CTB protocol. intel_guc_ct_send is responsible for:

Message Packing: Packing the Action ID and parameters into a format compliant with the GuC firmware interface.
Writing to Buffer: Copying the data into the H2G circular buffer.
Triggering Doorbell: Triggering an interrupt by writing to an MMIO register (e.g., HOST2GUC_INTERRUPT) to alert the GuC to process the H2G message.

3.4.4 Parallel Submission (Multi-LRC)

For media engines or Virtual Engines, the GuC supports Multi-LRC mode. In this mode, the driver uses a Work Queue (WQ). It appends information for multiple LRCs to the WQ (guc_wq_item_append) and then notifies the GuC all at once to perform parallel scheduling.

Summary

The evolution of command submission reflects a classic truth of computer architecture: offload specialized work to specialized hardware.

Ringbuffer: Simple structure, with the CPU and GPU tightly coupled.
Execlists: Hardware supports multiple processes, but the CPU is forced to become a "head steward" with a heavy scheduling burden.
GuC: Introducing a dedicated microcontroller inside the GPU achieves thorough CPU offloading and ultra-low latency.

After understanding how tasks are submitted, you may have a question: Since GPU execution is completely asynchronous (especially in the GuC era, where the CPU rings the doorbell and walks away), how does the CPU know when a task has finished executing? If I want to send this frame to the display, how do I synchronize it?

That is the mystery we will unravel in the next lecture.

DEV Community