Unusual actions when attempting to route my 5700XT through QEMU via VFIO on Arch/Gnome 41
Unusual actions when attempting to route my 5700XT through QEMU via VFIO on Arch/Gnome 41
Before I go any further I feels its important to note this very pertinent detail, this is not a straight forward IOMMU passthrough. I'm using a guide which enables you to pass a single GPU through to a VM by using a script that shuts down your DM then disables the driver before passing everything over to VFIO. Others with 5700XTs are reporting success using the exact method I am using so I know its possible. So here's what happens, I run my Win 10 VM and QEMU auto fires off a script which is supposed to do the following and in this order: 1) Kills the Gnome session and closes down GDM 2) Sends a null request to all active ttys 3) Shuts down the audio subsystem, in my case that's pulseaudio (since this is used by snd_intel_hd for the GPUs HDMI audio stack) 4) Waits for 6 seconds to give GDM time to release the GPU 5) Uses modprobe to remove both the AMDGPU and snd_intel_hd kernel modules 6) Uses virsh add-dedev-detach to unbind both pieces of hardware from the system kernel by using IOMMU group IDs 7) Uses modprobe to load up the VFIO modules and passes both the pieces of hardware over to VFIO for use in the VM If I load up the VM using the script then it appears as though it is working, GDM shuts off and after a few seconds the VM does get passed through however after booting into Windows I get the dreaded Error 43. After literal hours of Googlefu and debugging I worked out how to make the script verbose and used my laptop to run the script over SSH so I could monitor the output, what happens is modprobe falls over when it tries to unbind AMDGPU and when the VM loads AMDGPU is still running as a kernel module meaning the Windows driver cannot init the card correctly. I tried manually forcing the module to unload (modprobe -rf amdgpu) but all I get is "Unable to unload module, AMDGPU is currently in use". Dmesg is not really much help either, it contains a VERY verbose stack trace but unfortunately I don't speak C so it might as well be Swahili to me. The only English warning I can find say "Unable to unmount temperature sensor" I realise this is a very niche problem but I'm really hoping someone can give me a clue where to go from here. Another (anecdotal) observation I have made, all the people who have reported success seem to be using either SDDM or LDM and the few other cases I can find of people having issues all seem to be using GDM, is it possible GDM has some kind of hook running thats preventing AMDGPU from unloading? One last (maybe) relevant thing, I'm using the TKG-PDS kernel. Thanks.
It seems to be related to GTK and possibly GNOME 41. The description mentions using GPU acceleration via mutter, which is common in GTK4 applications. It’s worth noting that some tricks might not work anymore with the latest GTK versions. You’re asking if you’re using a desktop PC running ArchLinux with Gnome 41, which boots a Windows VM and enables passthrough. Are you interpreting this correctly?
Yeah, had the same idea as well. Have already tried disabling lm_sensors but no dice. After more time yesterday I think that message might be a red herring, its hard to tell but it looks like that message is getting fired off during systemd initialisation (while the system is booting up), it only ever happens during boot up and no matter how many times I run the script or try manually removing amdgpu it never fires again. I spent hours yesterday googling and I think I have found the issue, its apparently a bug with the drm kernel module. A kernel patch is available but I really don't fancy manually rebuilding the kernel from scratch just to try a fix that may or may not work plus since I'm on arch doing so would be pointless anyway, there's a new kernel version released at least once per week and that would undo the patch anyway. I reached the same module as my conclusion before I found the post, by using "modprobe -rf amdgpu --needed-dependencies" I was able to backtrace the issue to drm as the module that is refusing to unload. I guess I'll wait until the patch I found is upstreamed into the main kernel stack, possibly it will come with 5.15 Here's the source I found for the drm issue - https://gitlab.freedesktop.org/drm/amd/-/issues/1081 Edit - also yeah, its a single gpu passthrough using a custom qemu hook package.
5.15 met it's merge schedule so unless a merge request was made before yesterday there's likely on chance for the fix to be coming, it's possible to request it upstream in your distro. But compiling a kernel isn't really that hard (Then again i run gentoo and compile a new kernel about every week)