VIRTUALIZATION · TRANSMISSION
Ghost in the Latency: Achieving Zero-Copy Linux Virtualization
Low latency isn’t just a convenience; it’s a prerequisite for the 10x Director workflow. If your remote workstation stutters every time you open a browser tab, your decision velocity drops to zero.
At Portia Labs, our main workstation is an Arch Linux VM hosted on a high-spec CachyOS machine. Over the last 48 hours, we’ve implemented an iGPU passthrough + hardware encode setup that (when tuned correctly) removes the classic virtualization “CPU fight”.
This is the technical blueprint — updated with an external audit’s corrections and the stability caveats that matter on AMD Zen 4 “Raphael”.
Spec:
/specs/2026-02-14-ghost-in-the-latency-audit.md
The Problem: the “CPU Fight” (and why jitter is the real enemy)
In a standard VM graphics stack, frame delivery becomes a multi-copy pipeline:
- Guest renders
- Host/guest copies frames across the virtualization boundary
- CPU encodes / scrapes framebuffers
The result isn’t just higher average latency — it’s jitter (variance in frame times). That jitter is the “ghost” that breaks flow.
The Goal: zero-copy-ish capture + hardware encode
The target model is:
- Guest renders using the passed-through iGPU
- The streaming stack captures via DMA-BUF / KMS (where possible)
- The iGPU’s VCN encodes (H.264/HEVC/AV1)
- Only the compressed bitstream hits the network
This reduces CPU contention and stabilizes frame pacing.
Hardware baseline (Raphael specifics)
Ryzen 7000 (“Raphael”) integrates a small RDNA2 iGPU on the I/O die (IOD). That’s the enabler — and also why reset/initialization problems show up:
- The PSP (Platform Security Processor) participates in device init and power-state transitions.
- A VM reboot can trigger a re-init path the PSP rejects/times out, wedging the GPU (“reset bug”).
Phase 1: Host isolation and deterministic binding
1) IOMMU grouping (reality check)
Before you do anything else, verify the iGPU and its audio function are isolated.
Typical device IDs on Raphael:
- iGPU:
1002:164e - Audio:
1002:1640
If the iGPU is grouped with critical devices (USB controllers, root complex, PSP-adjacent devices, etc.), passthrough becomes unsafe.
Last-resort workaround (common on consumer boards):
pcie_acs_override=downstream,multifunction
This can force the kernel to split IOMMU groups artificially.
Tradeoff: This is a security compromise (peer-to-peer DMA risk). It can be acceptable for a single-user workstation, but is not a multi-tenant/enterprise security posture.
2) Framebuffer conflict (modern kernels)
The original guidance video=efifb:off is often insufficient on modern 6.x kernels because the early boot console may still claim the device through newer framebuffer initialization paths.
A more robust option used in advanced VFIO setups is:
initcall_blacklist=sysfb_init
This prevents the system framebuffer initialization code from running, reducing the chance that the host “touches” the iGPU before VFIO claims it.
Note: Some distros use
simpledrm/sysfb paths aggressively. The intent here is boot hygiene: keep the device pristine until VFIO binds.
3) VFIO binding mechanics (make it deterministic)
Relying only on:
vfio-pci.ids=1002:164e,1002:1640
can still lose the race if amdgpu binds first.
Best practice: use a modprobe soft dependency so VFIO always wins.
Create /etc/modprobe.d/vfio.conf on the host:
softdep amdgpu pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci
options vfio-pci ids=1002:164e,1002:1640
This tells the kernel: “before loading amdgpu, load vfio-pci.”
Phase 2: Solving the Raphael reset bug (what to keep, what to drop)
Why amdgpu.dc=0 is usually the wrong fix for desktop streaming
amdgpu.dc=0 disables AMD’s Display Core.
That can be useful for headless compute nodes, but it has a critical implication for a “10x Director” desktop workflow:
- If the display core is disabled, you may lose the KMS/CRTC timing path that many capture stacks depend on.
- Sunshine setups that rely on KMS capture can break because there is no “real” display pipeline.
Correction: Don’t use amdgpu.dc=0 as a default for a UI/desktop streaming VM.
If you need headless streaming, prefer:
- EDID emulation / virtual display (keep the display core on)
Keep: amdgpu.runpm=0 (stability > watts)
amdgpu.runpm=0 disables runtime power management.
This is frequently a real stability improvement because the wedge is often triggered when the guest attempts a D3Cold wake sequence and the PSP fails to re-authenticate.
Tradeoff: Expect a small power penalty (~5–10W), but materially better stability.
De-emphasize: amdgpu.noretry=0
This parameter is more relevant to certain legacy/compute paths (e.g., older ROCm/XNACK behavior). It’s unlikely to be the “secret sauce” for Raphael reset stability.
Phase 3: Advanced stability (reset reliability + Windows VMs)
Vendor reset: more reliable than kernel params alone
Guest kernel parameters can improve stability, but they’re not a real “reset”. A more robust approach used in VFIO communities is the vendor-reset kernel module.
Conceptually:
- it hooks VM shutdown/restart paths
- issues a device-specific reset sequence that clears the PSP-side “hung” state
- reduces the odds that the next VM boot wedges the iGPU again
This is often more reliable than piling up guest amdgpu.* parameters.
Windows: Error 43 and vBIOS ROM injection
If you’re passing the iGPU to Windows, you may hit the classic failure mode where the device is stopped (Error 43) because the VBIOS is missing/corrupt due to host-first initialization.
A common mitigation is vBIOS extraction + libvirt ROM injection:
<rom file='/usr/share/kvm/raphael_vbios.bin'/>
Treat this as an advanced path (board/firmware dependent).
Phase 4: The zero-copy streaming stack (Sunshine/Moonlight)
The “zero-copy” claim only holds if your capture/encode path avoids CPU readback.
Sunshine capture backend matters
Sunshine supports multiple capture backends. The implications are huge:
- X11/XCB: typically slower; often involves CPU copies (not what we want)
- KMS (DRM): the AMD-friendly architecture for zero-copy-ish capture
- wlroots/Wayland (compositor-dependent): can also work, but depends on your stack
Best practice for AMD passthrough workflows:
- set Sunshine capture to KMS (or the appropriate Wayland backend)
- set the encoder to VA-API so the iGPU’s VCN does the encode
Mechanism (high level): Sunshine gets a DMA-BUF file descriptor from DRM/KMS that references the framebuffer in VRAM, then passes that handle into VA-API for hardware encode.
Headless without a dummy plug: EDID emulation
Many GPUs behave badly headless (no monitor): no frames, or 640×480 defaults. For a “dongle-free” workflow, use EDID emulation.
Example kernel parameters (device names vary):
drm.edid_firmware=HDMI-A-1:edid/custom_4k.bin video=HDMI-A-1:e
This forces the GPU to believe a high-end display is connected so the desktop renders at full fidelity for the stream.
Network note: “0.3ms” isn’t motion-to-photon
0.3ms is not a plausible 5G RTT. In practice, “0.3ms” usually refers to a local stage (often client decode time or another internal pipeline number).
A realistic breakdown looks more like:
- host processing (render + encode): ~3–5ms (varies)
- 5G network RTT: ~15–40ms (varies)
- client decode: ~0.3–1ms (hardware dependent)
- display latency (vsync/panel): ~5–10ms
Total motion-to-photon is commonly ~30–60ms depending on settings.
5G pitfalls and tuning (MTU + FEC)
MTU fragmentation trap
Cellular MTUs are often smaller than 1500 (tunneling overhead). If you send 1500-byte payloads, the carrier may fragment packets → jitter, loss, stutter.
Guidance:
- if you’re using Tailscale, confirm you’re Direct (UDP hole-punched) and not DERP-relayed
- use a packet/payload size that fits the real path MTU (e.g., ~1350 bytes) to avoid fragmentation
FEC is non-optional on 5G
On 5G, packet loss happens. Waiting for retransmits causes visible stutter.
Forward Error Correction (FEC) trades bandwidth for smoothness. A common starting point:
- FEC ≈ 20%
That’s ~20% more bandwidth, but fewer stutter events because the client can reconstruct missing packets.
Detailed configuration blueprint (corrected protocol)
Below is a consolidated, “known-good” blueprint derived from the audit. Treat device names (e.g., HDMI-A-1) as examples — verify yours.
Host kernel parameters (CachyOS / Arch)
File: /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt initcall_blacklist=sysfb_init vfio-pci.ids=1002:164e,1002:1640 pcie_acs_override=downstream,multifunction"
Notes:
initcall_blacklist=sysfb_initis the robust replacement for the oldervideo=efifb:offadvice.pcie_acs_override=downstream,multifunctionis last resort if your board won’t give clean IOMMU groups; it weakens isolation.
Modprobe soft dependencies
File: /etc/modprobe.d/vfio.conf
softdep amdgpu pre: vfio-pci
softdep snd_hda_intel pre: vfio-pci
options vfio-pci ids=1002:164e,1002:1640
Guest VM configuration (libvirt XML)
Use ROM injection instead of amdgpu.dc=0 hacks:
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x10' slot='0x00' function='0x0'/>
</source>
<rom file='/var/lib/libvirt/vbios/raphael_vbios.bin'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
</hostdev>
Guest kernel parameters
Retain the stability lever, drop the display-core disable:
GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.runpm=0"
Headless (no dummy plug) EDID emulation example:
drm.edid_firmware=HDMI-A-1:edid/custom_4k.bin video=HDMI-A-1:e
Sunshine optimization (starting values)
| Parameter | Recommended value | Context |
|---|---|---|
| Encoder | VA-API (AMD) | Hardware encode, low CPU |
| Capture | KMS (DRM) | DMA-BUF path |
| FEC | 20% | Critical for 5G smoothness |
| Packet size | 1350 bytes | Avoid 5G fragmentation |
| Threads | 4 | Latency vs CPU tradeoff |
Benchmarks: interpret the numbers correctly
A useful way to think about improvements is by pipeline stage (illustrative example):
| Stage | Standard VirtIO | Optimized iGPU passthrough | Delta |
|---|---|---|---|
| Render | 8ms | 4ms | -4ms |
| Copy to RAM | 5ms (CPU copy) | 0ms (DMA-BUF) | -5ms |
| Encode | 10ms (software) | 3ms (hardware) | -7ms |
| Transmission | 30ms | 30ms | 0ms |
| Decode | 5ms | 0.3ms | -4.7ms |
| Total | 58ms | 37.3ms | -20.7ms |
The important lesson: you can’t beat physics on the network leg, so you win by making everything else deterministic and by reducing loss/jitter (MTU + FEC).
Practical outcome
The initial implementation relied on incomplete workarounds for the platform’s firmware limitations. By adopting initcall_blacklist=sysfb_init for host isolation, the vendor-reset module for lifecycle management, and a KMS-based capture pipeline with robust network tuning, you end up with a system that’s not only fast, but reliable enough for the “10x Director.”
This is the future of remote work: hardware that lives in the cloud (or the closet) but feels like it’s at your fingertips.
Note: This synthesis draws on patterns from Arch Linux/VFIO communities, kernel documentation, and virtualization research to provide a verified, robust configuration path.
Related Intel
- Human-on-the-Loop: Orchestrating Parallel Agent Fleets
- 100-Hour Week: The 10x Director Operating System
- Safety Valve: Preventing Agent Runaway
- Digital Archaeology: How to Recover a Project You Don’t Understand
- Safety Valve: Guardrails for Remote Systems
- Context Hygiene: Keeping Your Remote Dev Session High-Signal
Work with Portia Labs
If you want help applying this in your own environment:
- Remote Dev Latency Clinic — find the real source of jitter/lag, tune capture + encode + network, and leave with a written plan.
- Agent Workflow Audit — tighten specs/PR discipline + CI guardrails so your system stays reliable.
See our ProtonMail Resurrection case study for an example of these principles in action.
Explore Our Services | Contact Us
Drafted by Jarvis for Portia Labs.