Comprehensive guide to performance optimizations for gaming on virtual machines with KVM/QEMU and PCI passthrough

Preamble

This guide describes performance optimizations for gaming on a virtual machine (VM) with GPU passthrough.

In order to optimize the user experience for virtualized gaming, I started to pursue low latency and high performance. Especially since my main focus are demanding online multiplayer games.

The Benchmarking Process

I started benchmarking several setups (found on level1tech, reddit and the arch wiki) using Superpossition Demo to compare perfomance results, and LatencyMon 6.70 in order to measure the input lag.

Unfortunately, after several hours of testing I had to learn that those two categories weren’t enough to guarantee a decent gaming experience. Additionally, even when I had one game running smoothly it was not guaranteed that others were not stuttering.

I had to extend my observations.

Instead of testing the performance with the Superpossition demo (measuring FPS only), I used three games to measure my progress.

  • Player Unknowns Battlegrounds (PUBG)
    • which is CPU demanding
  • Apex Legends
    • which is GPU demanding
  • Blizzards Overwatch (OW)
    • which has a pretty low profile

As the previous tests had shown, FPS alone are no reliable performance measurement in a virtual gaming environment.

Thus I came up with the following rules:

  1. Input lag  – the smallest possible input latency is crucial for gaming.
  2. Consistency – no freezes and/or stuttering should
  3. Performance – more is better, gaming performance is measured in frames per second (fps)
  4. Stability – there shall be no crashes!
  5. Compatibility – don’t get banned, work fluently with anti-cheat tools etc.

Whereas 1 is more important than 2 is more important than 3 and so on.

Benchmarking Results

I have forged two libvirt XML files during the tests.

One for i440fx and one for Q35 chipsets. I reached the better gaming experience on a virtual machine with i440fx, Windows 10 1803 configuration.

Passages from QEMU 4.1, i440fx chipset libvirt config file

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>win10-1803-i440fx</name>
  <uuid>073f2a4e-5ab2-4bc7-99c2-2ac006adc87e</uuid>
  <memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <memoryBacking> <!-- hugepages enable -->
    <hugepages/>
  </memoryBacking>
  <vcpu placement='static'>8</vcpu>
  <iothreads>2</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='9'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='11'/>
    <vcpupin vcpu='4' cpuset='12'/>
    <vcpupin vcpu='5' cpuset='13'/>
    <vcpupin vcpu='6' cpuset='14'/>
    <vcpupin vcpu='7' cpuset='15'/>
    <emulatorpin cpuset='0-3'/>
    <iothreadpin iothread='1' cpuset='0-1'/>
    <iothreadpin iothread='2' cpuset='2-3'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-i440fx-4.1'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
       [...]
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
<vapic state='on'/>
<spinlocks state='on' retries='8191'/>
<vpindex state='on'/>
<synic state='on'/>
<stimer state='on'/>
<reset state='on'/>
<vendor_id state='on' value='1234567890ab'/> <!-- nvidia error code 43 prevention -->
<frequencies state='on'/> </hyperv> <kvm> <hidden state='on'/> <!-- nvidia error code 43 prevention --> </kvm> <vmport state='off'/> <ioapic driver='kvm'/> <!-- required for QEMU 4.0 or later --> </features> <cpu mode='custom' match='exact' check='none'>
<model fallback='allow'>EPYC</model>
<topology sockets='1' cores='4' threads='2'/>
<feature policy='require' name='topoext'/>
<feature policy='require' name='svm'/>
<feature policy='require' name='apic'/>
<feature policy='require' name='hypervisor'/>
<feature policy='require' name='invtsc'/>
</cpu>
<clock offset='localtime'> <timer name='rtc' present='no' tickpolicy='catchup'/> <timer name='pit' present='no' tickpolicy='delay'/> <timer name='hpet' present='no'/> <timer name='hypervclock' present='yes'/> <timer name='tsc' present='yes' mode='native'/> </clock> [...] <devices> <emulator>/usr/local/bin/qemu4.1-system-x86_64</emulator> [...] </devices> <qemu:commandline> <qemu:env name='QEMU_AUDIO_DRV' value='pa'/> <qemu:env name='QEMU_PA_SAMPLES' value='8192'/> <qemu:env name='QEMU_AUDIO_TIMER_PERIOD' value='99'/> <qemu:env name='QEMU_PA_SERVER' value='/run/user/1000/pulse/native'/> </qemu:commandline> </domain>

[collapse]

With the Q35 settings I reach lower input latency (with stutters though).

These are my settings for Q35, Windows 10 1903.

Passages from QEMU 4.1, Q35 chipset libvirt config file

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>win10-1903-q35</name>
  <uuid>cc37803d-a904-44cd-a333-5830ce22d20f</uuid>
  <memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <memoryBacking> <!-- hugepages enable -->
    <hugepages/>
  </memoryBacking>
  <vcpu placement='static'>8</vcpu>
  <iothreads>2</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='9'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='11'/>
    <vcpupin vcpu='4' cpuset='12'/>
    <vcpupin vcpu='5' cpuset='13'/>
    <vcpupin vcpu='6' cpuset='14'/>
    <vcpupin vcpu='7' cpuset='15'/>
    <emulatorpin cpuset='0-3'/>
    <iothreadpin iothread='1' cpuset='0-1'/>
    <iothreadpin iothread='2' cpuset='2-3'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-4.1'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
       [...]
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
<vapic state='on'/>
<spinlocks state='on' retries='8191'/>
<vpindex state='on'/>
<synic state='on'/>
<stimer state='on'/>
<reset state='on'/>
<vendor_id state='on' value='1234567890ab'/> <!-- nvidia error code 43 prevention -->
<frequencies state='on'/> </hyperv> <kvm> <hidden state='on'/> <!-- nvidia error code 43 prevention --> </kvm> <vmport state='off'/> <ioapic driver='kvm'/> <!-- required for QEMU 4.0 or later --> </features> <cpu mode='custom' match='exact' check='none'>
<model fallback='allow'>EPYC</model>
<topology sockets='1' cores='4' threads='2'/>
<feature policy='require' name='topoext'/>
<feature policy='require' name='svm'/>
<feature policy='require' name='apic'/>
<feature policy='require' name='hypervisor'/>
<feature policy='require' name='invtsc'/>
</cpu>
<clock offset='localtime'> <timer name='rtc' present='no' tickpolicy='catchup'/> <timer name='pit' present='no' tickpolicy='delay'/> <timer name='hpet' present='no'/> <timer name='kvmclock' present='no'/> <timer name='hypervclock' present='yes'/> <timer name='tsc' present='yes' mode='native'/> </clock> [...] <devices> <emulator>/usr/local/bin/qemu4.1-system-x86_64</emulator> [...] </devices> <qemu:commandline> <qemu:env name='QEMU_AUDIO_DRV' value='pa'/> <qemu:env name='QEMU_PA_SAMPLES' value='8192'/> <qemu:env name='QEMU_AUDIO_TIMER_PERIOD' value='99'/> <qemu:env name='QEMU_PA_SERVER' value='/run/user/1000/pulse/native'/> </qemu:commandline> </domain>

[collapse]

The following chapters will give you some optimization tips for Host, Guest and hopefully some insights behind the picked libvirt settings.

Overview – the GPU Passthrough Setup

I am using an AMD Ryzen platform for my GPU passthrough setup. This means, some optimizations are especially (or only) relevant to Ryzen CPUs, while others are relevant to any system. I will try to mark these settings throughout the article.

Hardware components

  • CPU: AMD Ryzen 7 1800x (8 Core @3.6GHz)
  • RAM: 32GB DDR4 RAM (@2800MHz)
  • Mainboard: ASUS Prime x370 pro (BIOS version 4207)

Attention! The ASUS Prime x370 pro BIOS versions for RYZEN 3000-series support (up to current latest version 5220 and further), will break a PCI passthrough setup. Error “Unknown PCI header type ‘127’ “. BIOS versions up to (and including) 4406 are working.

Software Components

The Ubuntu host

  • OS: (X)Ubuntu 18.04
  • Kernel: 5.3.6
  • Hypervisor: QEMU version 4.1
  • Manager: Libvirt version 4.7

The Windows virtual machine (guest)

  • Windows 10 version 1903 on Q35 chip
  • Windows 10 version 1803 on i440fx chip
  • Nvidia Driver version 436.68

Attention! A known bug for libvirt and Windows 10 1903: Do not use 6ch/9ch audio devices in the virtual machine, as it creates awful stuttering and performance loss. Using ac97 audio fixes this issue.

Host OS Optimization

CPU Governor Settings

Right after your Host system has booted, the CPU governor settings are usually set to “on demand“. When the CPU requires a boost for a process it will be allowed, in other cases it saves energy.

Unfortunately does the boost trigger from within a virtual machine not work consistent in my tests.

Thus, I force CPU governor setting to “performance” on the host, while the virtual machine is running. The downside is higher energy consumption. :

I use this bash script to enable performance.

#!/bin/bash
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo "performance" > $file; done
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

I use this bash script to enable on-demand afterwards.

#!/bin/bash
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo "ondemand" > $file; done
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Unfortunately I have lost the exact source to give credit for the code. source1, source2

QEMU and LIBVIRT Versions

QEMU version 3.1 or higher is recommended as it adds improved SMT support for Ryzen CPUs. Ubuntu-1804-LTS’s version is currently 2.12.0. So I recommend upgrading by either building QEMU on your own (read here) or download it via custom ppa.

You can check the libvirt documentation for suppoorted commands and version matching. Why do we need a higher Libvirt version? Not all libvirt versions support all XML commands (link to documentation – )

Clocksource

Make sure ‘tsc’ is set as clock source. You can check this via:

cat /sys/devices/system/clocksource/clocksource0/current_clocksource

this should return ‘tsc

Guest OS Optimization

The optimization recommendations in this chapter are not solely relevant for virtual machines. They can be used as for Windows in general.

Windows 1903 is used on purpose, because it brings better Ryzen SMT support.

Enable MSI Interrupts

One can enable MSI interrupts for passed through hardware with the MSI_util_v2. Get it from CHEF-KOCHs git-hub repository.

Spectre Patches (optional)

it is possible to disable the spectre patches in the system using InSpectre tool (downside less secure). Heiko Sieger has an article about the topic.

Virtual machine Configuration Optimizations

Five sections, in the Libvirt virtual machine configuration, are crucial in order to optimize the virtual machines performance:

  • CPU pinning
  • CPU model information
  • Hyper-V enlightments
  • Clock settings
  • Hugepages

CPU Pinning

CPU-pinning will allocate CPU-cores for mainly (or solely) Guest tasks, when the Guest is running. If everything works as expected, the Host will not use the Guest allocated CPU cores.

One could go one even further and restrict access to the guest cores completely, even if the guest isn’t running. This would use the isolcpus kernel command line flag at boot time. I do not use this feature as I would like to have maximum Host performance if the Guest is not running.

AMD Ryzen CPU architecture

Ryzen CPU architecture

The AMD Ryzen architecture houses 8 physical cores, each core capable of handling two threads. This leads to a total of 16 cores available for pinning. The 8 cores are separated into two complexes of 4 cores called CCX. Each CCX has its own L3 cache.   The plan is to have one CCX for the host, and one CCX for the guest. As the hosts runs first, ill assume it will use the (first) CCX with cores 0-3. The second CCX (cores 4-7) shall be used for the virtual machine.   I used a 12 pin setup to the Guest (6 cores) for half a year.

Sometimes I encountered micro lag-spikes which I couldn’t track down. Then I switched to 8 cpus (4 cores).

Core separation between host and guest system

The plan is to have one CCX for the host, and one CCX for the guest. As the hosts runs first, ill assume it will use the (first) CCX with cores 0-3. The second CCX (cores 4-7) shall be used for the virtual machine. I used a 12 pin setup to the Guest (6 cores) for half a year. Sometimes I encountered micro lag-spikes which I couldn’t track down. Then I switched to 8 cpus (4 cores).

The benchmarks indicated that the 6 core pinning had better CPU mark results, and slightly higher FPS, but since I switched to the CCX seperation the lag spikes went away.

Here are my settings:

AMD Ryzen CPU Pinning Recommendation for Optimal Gaming Performance

In order edit the virtual machines configuration use: virsh edit your-windows-vm-name

Once your done editing, you can use CTRL+x CTRL+y to save the changes. First of all find the very first line, which should read:

<domain type='kvm'>

and replace it with:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

Now find the line which ends with </vcpu>and add the following block in the next line:

<vcpu placement='static'>8</vcpu>
<iothreads>2</iothreads>
<cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='9'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='11'/>
    <vcpupin vcpu='4' cpuset='12'/>
    <vcpupin vcpu='5' cpuset='13'/>
    <vcpupin vcpu='6' cpuset='14'/>
    <vcpupin vcpu='7' cpuset='15'/>
    <emulatorpin cpuset='0-1'/>
    <iothreadpin iothread='1' cpuset='0-1'/>
    <iothreadpin iothread='2' cpuset='2-3'/>
 </cputune>

I have tested <vcpusched vcpus='0' scheduler='fifo' priority='1'/> for the every pinned core but removed it eventually. Without it, it felt more responsive. I have no hard number benchmarks to proof this though.

Remark! Make sure <vcpu>, <iothreads> and <cputune> have the same indent.

Remark! Make sure the pinned cores do match the CPUs topology from below.

CPU Model Information

The chapter above gave us some insights in the AMD Ryzen CPU structure. It is a good thing if the Guest operating system also knows about the structure.

The Libvirt CPU model and topology settings block are used to make the Guest aware of the CPU specifications (as CCX layout, chache size, etc.)

CPU Mode and Cache

For Ryzen CPUs the model definition “Epyc” can be used, this is recommended for QEMU version 3.1 and below. This defines structure and caches of the CPU.

For Qemu version above 3.1 host-passthrough is also feasable.

Attention! Windows release >1803 require the Host to havekvm ignore_msrs=1 enabled, otherwise BSOD occur.

CPU Topology

This is were the actual number of cores are defined. Using 1 socket with 4 cores and 2 threads will toll the Guest operating system it has access to one 4 core hyperthreading CPU.

Important for the CPUs topology is that the number of cores matches the number of pinned cores from above.

CPU Features

In both cases <feature policy=’require’ name=’topoext’/> should be added as CPU feature. See arch wiki. But as /u/llitz said using EPYC is to best way to have an optimal setup.

AMD Ryzen CPU model recommendation for optimal gaming performance

Find the block <CPU> and adapt it to look like this:

<cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='4' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='topoext'/>
    <!-- add additional cpu features here-->
 </cpu>

For QEMU 3.1 and below “EPYC” is prefered over “host-passthrough”:

<cpu mode='custom' match='exact' check='none'>
    <model fallback='allow'>EPYC</model>
    <topology sockets='1' cores='4' threads='2'/>
    <feature policy='require' name='topoext'/>
    <!-- add additional cpu features here--> 
</cpu>

Hyper-V Enlightments

HyperV enlightments help the Guest operating system handling virtualization tasks. The operating has to support those features (Win 10 1903 should be better with it than Win 10 1803).

After excessive testing I went with the settings shown beow. My general rule of thumb is “the more the merrier“. The vendor_id setting is only required for Nvidia Error 43 prevention.

The function of each setting (and for which version it is available) can be analyzed in the libvirt documentation.

Add HyperV enlightments in the block <features> and add the following block in parallel to the <acpi> block:

HyperV Settings Recommendaction

<hyperv>
   <relaxed state='on'/>
   <vapic state='on'/>
   <spinlocks state='on' retries='8191'/>
   <vpindex state='on'/>
   <synic state='on'/>
   <stimer state='on'/>
   <reset state='on'/>
   <vendor_id state='on' value='1234567890ab'/> <!-- nvidia error code 43 prevention -->
   <frequencies state='on'/>
</hyperv>

Remark: Make sure <hyperv>and <acpi> have the same indent.

Clock Settings

todo.

Hugepages

I have written a seperate article about setting up and using hugepages – you can find it here.

Troubleshooting

A common issues and troubleshooting article exists here.

Updates

  • 21.08.2019 – Added further information and todos
  • 30.09.2019 – Updated Hyper-V settings
  • 21.11.2019 – Rewrote the article and added further information

6 Comment

  1. JK says: Reply

    Thanks for great configuration. Just bought Ryzen 2700 due to similar reasons – I had problems with guest latency, hope extra cores solve this issue.

    1. Mathias Hueber says: Reply

      Let me know if it helps, I am always looking for further tweaking suggestions.

  2. […] I used AMD μProf to help with mapping out my 3700X (and this writeup on CPU-pinning): […]

  3. Ian says: Reply

    Not sure why, but it looks like the physical cores for my 1700 are now arranged differently.
    Core 0 uses threads 0&8, core 1 uses 1&9, core 2 uses 2&10, etc.
    However the logical indexes haven’t changed, so not quite sure which one I must use

    1. Mathias Hueber says: Reply

      Ohh, that is strange. Which BIOS version are you running? Is it still working though? I have read that some newer BIOS versions might break vfio passthrough altogether. See https://forum.level1techs.com/t/attention-amd-vfio-users-do-not-update-your-bios/142685

  4. Kckhfn says: Reply

    Thanks for providing these helpful tips Mathias.

Leave a Reply

Wir benutzen Cookies um die Nutzerfreundlichkeit der Webseite zu verbessen. Durch Deinen Besuch stimmst Du dem zu.