This post written collaboratively by Attilio Rao and George Dunlap
Operating systems are generally written assuming that they are in direct control of the hardware. So when we run operating systems in virtual machines, where they share the hardware with other operating systems, this can sometimes cause problems. One of the areas addressed by a recently proposed patch series is the problem of spinlocks on a virtualized system. So what exactly is the problem here, and how does the patch solve it? And what is the effect of the patch when the kernel is running natively?
Spinlocks and virtualization
Multiprocessor systems need to be able to coordinate access to important data, to make sure that two processors don’t attempt to modify things at the same time. The most basic way to do this is with a spinlock. Before accessing data, the code will attempt to grab the spinlock. If code running on another processor is holding the spinlock, the code on this processor will “spin” waiting for the lock to be free, at which point it will continue. Because those waiting for the spinlock are doing “busy-waiting”, code should try to hold the spinlock only for relatively short periods of time.
Now let’s consider how this looks on a virtualized system. In this case, the VM has virtual cpus (vcpu) which share the physical cpus with virtual cpus from other VMs. Typically the total number of virtual cpus across all VMs exceeds the number of physical CPUs; in some cases, such as cloud environments, there may be several times as many vcpus as physical cpus.
To accomplish this, the hypervisor scheduler gives timeslices of physical processor time to the vcpus, similar to the way that operating system schedules processes. If the system has more virtual cpus wanting to run than physical processors to run them, some of them will be preempted to let others run. This allows the VMs to share the physical cpu resources effectively, but it breaks a hidden assumption in the spinlock algorithm: that the kernel code is not preempted while holding a spinlock.
Now, suppose that vcpu A grabs a spinlock, and before it finishes, is preempted by the scheduler. And then suppose vcpu B tries to grab the spinlock. Now B, instead of spinning for the short period of time that A needed the spinlock for, will be spinning until vcpu is scheduled again — which may be anywhere from several milliseconds to hundreds of milliseconds, depending on how busy the system is. B is now using the cpu but accomplishing nothing. It’s burning up its VM’s share of CPU time, and keeping other VMs with useful work to do from running. Worse yet, it may even be that the reason A is not running is that the hypervisor’s scheduler is trying to give priority to B — so B is actively keeping A from finishing the work that it needed to do with the spinlock.
The situation gets even worse with a more advanced form of spinlock called a ticketlock. Ticketlocks have a lot of advantages for large systems, including reduced cacheline bouncing and more predictable wait time. (See this LWN article for a complete description.) The key attribute for this discussion is that ticketlocks essentially make a first-come first-served queue: if A has the lock, then B tries to grab it, and then C, B is guaranteed to get the lock before C. So now, if C is spinning waiting for the lock, it must wait for both A and B to finish before it can get the lock.
The result is that on a moderately loaded system, the vast majority of the cpu cycles are actually spent waiting for ticketlocks rather than doing any useful work. This is called a “ticketlock storm”. (It was documented for the first time by Thomas Friebel in his presentation at XenSummit 2008.)
Fixes to the vCPU starvation
In order to fix this starvation problem, in 2008, Jeremy Fitzhardinge developed a Linux kernel patch. His approach is to offer an intermediate layer for the spinlocks to allow paravirt backends to redefine the full serie of spinlock operations. Then he wrote a XEN-specific implementation which:
- For the fast case (uncontented) uses the traditional spinlock single-byte lock approach, overriding the ticketlock logic
- If a vCPU cannot get the lock after a specific amount of iterations (probabilly because of lock contention), it adds itself on a per-cpu list and blocks on an event channel
- On unlock the per-cpu list is walked and the first vCPU in line is awaken
In other words, after spinning for a certain amount of time, the code will assume that the vcpu holding the lock is not running; and instead of continuing to spin, will yield the cpu so that other work can get done.
This case doesn’t penalize native Linux while still giving all the wanted benefits. However, the paravirtualized spinlocks introduced a performance regression when using as native. So a specific kernel option, CONFIG_PARAVIRT_SPINLOCK, was introduced to include them. That way, distros could choose whether to take the additional spinlock indirection overhead when running a kernel natively (CONFIG_PARAVIRT_SPINLOCK=y), or the risk ticketlock storm when running a kernel as a VM (CONFIG_PARAVIRT_SPINLOCK=n).
While this approach has proven to work well, it has also shown some problems. First of all the indirection layer adds some overhead, even if not excessive, on a critical path. Second, this model imposes some code duplication.
In order to address these issues, Jeremy worked on a new approach based on existing ticketlock implementation. More specifically, the fastpath is left untouched and some PVOPs are added to the slow paths. These PVOPs are responsible for doing the following:
- The __ticket_lock_spinning() PVOP is used in the lock case and it is invoked if a CPU has been spinning for a certain, specific, threshold. Once invoked, __ticket_lock_spinning() marks the spinlock as in slowpath state and blocks the vCPU.
- The __ticket_unlock_kick() is invoked in the release slow path case and kicks the next vCPU in line awake. When the lock gets uncontented
the slowpath bits gets cleared. Also, the bit signaling for the slow-path is stored in the lock tail ticket at the LSB, which means the number of CPUs ticketlocks can handle is reduced by a factor of 2, which is someway important on little tickets.
This allows paravirtualized ticketlock to share most of the code with native ticketlocks without the usage of any additive layer. It also removes the need for distros to choose between slower native performance and potentially catastrophic virtualized performance.
These patches have been heavilly tested in the past and recently they have been rebased to mainline Linux and further enhached by IBM engineers Srivatsa Vaddagiri and Raghavendra K T.
Benchmarks of the new approach
One of the key things the Linux maintainers care about when considering this kind of functionality is the effect it will have when running the kernel native. In order to support patch inclusion in mainline Linux, I (Attilio) wanted to show evidence that the patches were not introducing a performance regression in native configuration with real-world workloads. In order to reproduce real-world situations I used mainly 3 tools:
pgbench is a tool benchmarking PostgreSQL behaviour. It runs the same sequence of SQL commands repeteadly and calculates the average transactions per seconds. The sequence of commands can be customized, but by default pgbench uses five SELECT, UPDATE and INSERT for every transaction.
For my test I used a postgresql 9.2-devel version (mainline) as a backend, which has a lot of important scalability and performance improvements over the stable version, aiming for a larger output and more performing results. Also, I used a stock installation, with only a simple configuration change.
In order to have a full characterization of the scalability, I ran the workload with different sets of threads (ranging from 1 to 64), 10 times each, and used some warmup runs in order to load all the database in memory and thus avoid subsequent actual I/O operations when real measuration is taken. The script used for collecting datas is available here.
pbzip2 is a parallel implementation of the bzip2 file compressor utility using threads. I used this in order to emulate a CPU-intensive, multithreaded, application. The compressed file was a 1GB recipient created from /dev/urandom and all the I/O was performed on tmpfs in order to reduce floaty effects. Again, for a scalability characterization, the workload has been tried with several sets of threads and the used is here.
kernbench is a script comparing Linux kernel compile times, taking into account several number of make jobs. I went with the following invokation:
- kernbench -n10 -s -c16 -M -f
which basically means running the test for 10 times with 1 thread, 8 and 16 threads. Again, in order to reduce the I/O effect I used a tmpfs volume to do all the I/O.
The tests were performed on top of a vanilla Linux-3.3-rc6 (commit 4704fe65e55fb088fbcb1dc0b15ff7cc8bff3685), with a monholitic configuration. It was important also to estimate the impact of CONFIG_PARAVIRT_SPINLOCK option on this benchmark in order to eventually consider its removal.
The tests done, then, involved 4 different kernels which in turn had on and off Jeremy’s patch and CONFIG_PARAVIRT_SPINLOCK.
Below is information related to the system and machine used:
- Machine is a XEON x3450, 2.6GHz, 8-ways system
- System version, a Debian Squeeze 6.0.4
- gcc version 4.4.5
All the results have been chartered with ministat, a tool calculating fundamental statistical properties of data sets provided in files. They are summarized, divided by number of threads and kernel configurations, in the following links:
As you can easilly verify, ministat made an average calculation of the 10 retrieved values for every case, then compared the results and calculates difference and standard deviation for every compare.
The final result is that the patch doesn’t seem to introduce any performance penalty for these 3 workloads. Futhermore, it seems that the option CONFIG_PARAVIRT_SPINLOCK can be removed at the present time (although it likely needs to be re-evaluated with older CPUs than XEON x3450.)