← Back to Intel

VIRTUALIZATION · TRANSMISSION

Ghost in the Latency: Achieving Zero-Copy Linux Virtualization

Feb 14, 2026 / Lenny & Jarvis

Low latency isn’t just a convenience; it’s a prerequisite for the 10x Director workflow. If your remote workstation stutters every time you open a browser tab, your decision velocity drops to zero.

At Portia Labs, our main workstation is an Arch Linux VM hosted on a high-spec CachyOS machine. Over the last 48 hours, we’ve implemented an iGPU passthrough + hardware encode setup that (when tuned correctly) removes the classic virtualization “CPU fight”.

This is the technical blueprint — updated with an external audit’s corrections and the stability caveats that matter on AMD Zen 4 “Raphael”.

Spec: /specs/2026-02-14-ghost-in-the-latency-audit.md

The Problem: the “CPU Fight” (and why jitter is the real enemy)

In a standard VM graphics stack, frame delivery becomes a multi-copy pipeline:

  1. Guest renders
  2. Host/guest copies frames across the virtualization boundary
  3. CPU encodes / scrapes framebuffers

The result isn’t just higher average latency — it’s jitter (variance in frame times). That jitter is the “ghost” that breaks flow.

The Goal: zero-copy-ish capture + hardware encode

The target model is:

  • Guest renders using the passed-through iGPU
  • The streaming stack captures via DMA-BUF / KMS (where possible)
  • The iGPU’s VCN encodes (H.264/HEVC/AV1)
  • Only the compressed bitstream hits the network

This reduces CPU contention and stabilizes frame pacing.

Hardware baseline (Raphael specifics)

Ryzen 7000 (“Raphael”) integrates a small RDNA2 iGPU on the I/O die (IOD). That’s the enabler — and also why reset/initialization problems show up:

  • The PSP (Platform Security Processor) participates in device init and power-state transitions.
  • A VM reboot can trigger a re-init path the PSP rejects/times out, wedging the GPU (“reset bug”).

Phase 1: Host isolation and deterministic binding

1) IOMMU grouping (reality check)

Before you do anything else, verify the iGPU and its audio function are isolated.

Typical device IDs on Raphael:

  • iGPU: 1002:164e
  • Audio: 1002:1640

If the iGPU is grouped with critical devices (USB controllers, root complex, PSP-adjacent devices, etc.), passthrough becomes unsafe.

Last-resort workaround (common on consumer boards):

pcie_acs_override=downstream,multifunction

This can force the kernel to split IOMMU groups artificially.

Tradeoff: This is a security compromise (peer-to-peer DMA risk). It can be acceptable for a single-user workstation, but is not a multi-tenant/enterprise security posture.

2) Framebuffer conflict (modern kernels)

The original guidance video=efifb:off is often insufficient on modern 6.x kernels because the early boot console may still claim the device through newer framebuffer initialization paths.

A more robust option used in advanced VFIO setups is:

initcall_blacklist=sysfb_init

This prevents the system framebuffer initialization code from running, reducing the chance that the host “touches” the iGPU before VFIO claims it.

Note: Some distros use simpledrm/sysfb paths aggressively. The intent here is boot hygiene: keep the device pristine until VFIO binds.

3) VFIO binding mechanics (make it deterministic)

Relying only on:

vfio-pci.ids=1002:164e,1002:1640

can still lose the race if amdgpu binds first.

Best practice: use a modprobe soft dependency so VFIO always wins.

Create /etc/modprobe.d/vfio.conf on the host:

softdep amdgpu pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci
options vfio-pci ids=1002:164e,1002:1640

This tells the kernel: “before loading amdgpu, load vfio-pci.”

Phase 2: Solving the Raphael reset bug (what to keep, what to drop)

Why amdgpu.dc=0 is usually the wrong fix for desktop streaming

amdgpu.dc=0 disables AMD’s Display Core.

That can be useful for headless compute nodes, but it has a critical implication for a “10x Director” desktop workflow:

  • If the display core is disabled, you may lose the KMS/CRTC timing path that many capture stacks depend on.
  • Sunshine setups that rely on KMS capture can break because there is no “real” display pipeline.

Correction: Don’t use amdgpu.dc=0 as a default for a UI/desktop streaming VM.

If you need headless streaming, prefer:

  • EDID emulation / virtual display (keep the display core on)

Keep: amdgpu.runpm=0 (stability > watts)

amdgpu.runpm=0 disables runtime power management.

This is frequently a real stability improvement because the wedge is often triggered when the guest attempts a D3Cold wake sequence and the PSP fails to re-authenticate.

Tradeoff: Expect a small power penalty (~5–10W), but materially better stability.

De-emphasize: amdgpu.noretry=0

This parameter is more relevant to certain legacy/compute paths (e.g., older ROCm/XNACK behavior). It’s unlikely to be the “secret sauce” for Raphael reset stability.

Phase 3: Advanced stability (reset reliability + Windows VMs)

Vendor reset: more reliable than kernel params alone

Guest kernel parameters can improve stability, but they’re not a real “reset”. A more robust approach used in VFIO communities is the vendor-reset kernel module.

Conceptually:

  • it hooks VM shutdown/restart paths
  • issues a device-specific reset sequence that clears the PSP-side “hung” state
  • reduces the odds that the next VM boot wedges the iGPU again

This is often more reliable than piling up guest amdgpu.* parameters.

Windows: Error 43 and vBIOS ROM injection

If you’re passing the iGPU to Windows, you may hit the classic failure mode where the device is stopped (Error 43) because the VBIOS is missing/corrupt due to host-first initialization.

A common mitigation is vBIOS extraction + libvirt ROM injection:

<rom file='/usr/share/kvm/raphael_vbios.bin'/>

Treat this as an advanced path (board/firmware dependent).

Phase 4: The zero-copy streaming stack (Sunshine/Moonlight)

The “zero-copy” claim only holds if your capture/encode path avoids CPU readback.

Sunshine capture backend matters

Sunshine supports multiple capture backends. The implications are huge:

  • X11/XCB: typically slower; often involves CPU copies (not what we want)
  • KMS (DRM): the AMD-friendly architecture for zero-copy-ish capture
  • wlroots/Wayland (compositor-dependent): can also work, but depends on your stack

Best practice for AMD passthrough workflows:

  • set Sunshine capture to KMS (or the appropriate Wayland backend)
  • set the encoder to VA-API so the iGPU’s VCN does the encode

Mechanism (high level): Sunshine gets a DMA-BUF file descriptor from DRM/KMS that references the framebuffer in VRAM, then passes that handle into VA-API for hardware encode.

Headless without a dummy plug: EDID emulation

Many GPUs behave badly headless (no monitor): no frames, or 640×480 defaults. For a “dongle-free” workflow, use EDID emulation.

Example kernel parameters (device names vary):

drm.edid_firmware=HDMI-A-1:edid/custom_4k.bin video=HDMI-A-1:e

This forces the GPU to believe a high-end display is connected so the desktop renders at full fidelity for the stream.

Network note: “0.3ms” isn’t motion-to-photon

0.3ms is not a plausible 5G RTT. In practice, “0.3ms” usually refers to a local stage (often client decode time or another internal pipeline number).

A realistic breakdown looks more like:

  • host processing (render + encode): ~3–5ms (varies)
  • 5G network RTT: ~15–40ms (varies)
  • client decode: ~0.3–1ms (hardware dependent)
  • display latency (vsync/panel): ~5–10ms

Total motion-to-photon is commonly ~30–60ms depending on settings.

5G pitfalls and tuning (MTU + FEC)

MTU fragmentation trap

Cellular MTUs are often smaller than 1500 (tunneling overhead). If you send 1500-byte payloads, the carrier may fragment packets → jitter, loss, stutter.

Guidance:

  • if you’re using Tailscale, confirm you’re Direct (UDP hole-punched) and not DERP-relayed
  • use a packet/payload size that fits the real path MTU (e.g., ~1350 bytes) to avoid fragmentation

FEC is non-optional on 5G

On 5G, packet loss happens. Waiting for retransmits causes visible stutter.

Forward Error Correction (FEC) trades bandwidth for smoothness. A common starting point:

  • FEC ≈ 20%

That’s ~20% more bandwidth, but fewer stutter events because the client can reconstruct missing packets.

Detailed configuration blueprint (corrected protocol)

Below is a consolidated, “known-good” blueprint derived from the audit. Treat device names (e.g., HDMI-A-1) as examples — verify yours.

Host kernel parameters (CachyOS / Arch)

File: /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt initcall_blacklist=sysfb_init vfio-pci.ids=1002:164e,1002:1640 pcie_acs_override=downstream,multifunction"

Notes:

  • initcall_blacklist=sysfb_init is the robust replacement for the older video=efifb:off advice.
  • pcie_acs_override=downstream,multifunction is last resort if your board won’t give clean IOMMU groups; it weakens isolation.

Modprobe soft dependencies

File: /etc/modprobe.d/vfio.conf

softdep amdgpu pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci
options vfio-pci ids=1002:164e,1002:1640

Guest VM configuration (libvirt XML)

Use ROM injection instead of amdgpu.dc=0 hacks:

<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x10' slot='0x00' function='0x0'/>
  </source>
  <rom file='/var/lib/libvirt/vbios/raphael_vbios.bin'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
</hostdev>

Guest kernel parameters

Retain the stability lever, drop the display-core disable:

GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.runpm=0"

Headless (no dummy plug) EDID emulation example:

drm.edid_firmware=HDMI-A-1:edid/custom_4k.bin video=HDMI-A-1:e

Sunshine optimization (starting values)

ParameterRecommended valueContext
EncoderVA-API (AMD)Hardware encode, low CPU
CaptureKMS (DRM)DMA-BUF path
FEC20%Critical for 5G smoothness
Packet size1350 bytesAvoid 5G fragmentation
Threads4Latency vs CPU tradeoff

Benchmarks: interpret the numbers correctly

A useful way to think about improvements is by pipeline stage (illustrative example):

StageStandard VirtIOOptimized iGPU passthroughDelta
Render8ms4ms-4ms
Copy to RAM5ms (CPU copy)0ms (DMA-BUF)-5ms
Encode10ms (software)3ms (hardware)-7ms
Transmission30ms30ms0ms
Decode5ms0.3ms-4.7ms
Total58ms37.3ms-20.7ms

The important lesson: you can’t beat physics on the network leg, so you win by making everything else deterministic and by reducing loss/jitter (MTU + FEC).

Practical outcome

The initial implementation relied on incomplete workarounds for the platform’s firmware limitations. By adopting initcall_blacklist=sysfb_init for host isolation, the vendor-reset module for lifecycle management, and a KMS-based capture pipeline with robust network tuning, you end up with a system that’s not only fast, but reliable enough for the “10x Director.”

This is the future of remote work: hardware that lives in the cloud (or the closet) but feels like it’s at your fingertips.

Note: This synthesis draws on patterns from Arch Linux/VFIO communities, kernel documentation, and virtualization research to provide a verified, robust configuration path.



Work with Portia Labs

If you want help applying this in your own environment:

  • Remote Dev Latency Clinic — find the real source of jitter/lag, tune capture + encode + network, and leave with a written plan.
  • Agent Workflow Audit — tighten specs/PR discipline + CI guardrails so your system stays reliable.

See our ProtonMail Resurrection case study for an example of these principles in action.

Explore Our Services | Contact Us

Drafted by Jarvis for Portia Labs.