This week, we are reblogging this excellent piece from Luis from SUSE. The article came about because of a discussion Luis had at the Linux Foundation Collaboration Summit in Napa, and he decided to write down someÂ basic generals of theÂ xenstore, a review of its first implementation and a summary of oxenstored. The original post is available here, on Luis’ blog.
OXenstored is the default in Xen, only if the ocaml compiler was installed at build time. This means that in many Linux distributions that ship Xen, cxenstored is most likely your default. You can check whether oxenstored is installed by checking whether the oxenstored process runs in your Dom 0.
It all started with a paper: OXenstored – An Efficient Hierarchical and Transactional Database using Functional Programming with Reference Cell Comparisons
First a general description of the xenstore and its first implementation. The xenstore is where Xen stores the information over its systems. It covers dom0 and guests and it uses a filesystem type of layout kind of how we keep a layout of a system on the Linux kernel in sysfs. The original xenstored, which the paper refers to a Cxenstored was written in C. Since all information needs to be stored in a filesystem layout any library or tool that supports designing a tree to have key value store of information should suffice to upkeep the xenstore. The Xen folks decided to use the Trivial Database, tdb, which as it turns out was designed and implemented by the Samba folks for its own database. Xen then has a daemon sitting in the background which listens to reads / write requests onto this database, that’s what you see running in the background if you ‘ps -ef | grep xen’ on dom0. dom0 is the first host, the rest are guests. dom0 uses Unix domain sockets to talk to the xenstore while guests talk to it using the kernel through the xenbus. The code for opening up a connection onto the c version of the xenstore is in tools/xenstore/xs.c and the the call is xs_open(). The first attempt by code will be to open the Unix domain socket with get_handle(xs_daemon_socket()) and if that fails it will try get_handle(xs_domain_dev()), the later will vary depending on your Operating System and you can override first by setting the environment variable XENSTORED_PATH. On Linux this is at /proc/xen/xenbus. All the xenstore is doing is brokering access to the database. The xenstore represents all data known to Xen, we build it upon bootup and can throw it out the window when shutting down, which is why we should just use a tmpfs for it (Debian does, OpenSUSE should be changed to it). The actual database for the C implementation is by default stored under the directory /var/lib/xenstored, the file that has the database there is called tdb. On OpenSUSE that’s /var/lib/xenstored/tdb, on Debian (as of xen-utils-4.3) that’s /run/xenstored/tdb. The C version of the xenstore therefore puts out a database file that can actually be used with tdb-tools (actual package name for Debian and SUSE). xentored does not use libtdb which is GPLv3+, Xen in-takes the tdb implementation which is licensed under the LGPL and carries a copy under tools/xenstore/tdb.c. Although you shouldn’t be using tdb-tools to poke at the database you can still read from it using these tools, you can read the entire database as follows:
Â tdbtool /run/xenstored/tdb dump
The biggest issue with the C version implementation and relying on tdb is that you can live lock it if you have a guest or any entity doing short quick accesses onto the xenstore. We need Xen to scale though and the research and development behind oxenstored was an effort to help with that. What follows next is my brain dump of the paper. I don’t get into the details of the implementation because as can be expected I don’t want to read OCaml code. Keep in mind that if I look for a replacement I’m looking also for something that Samba folks might want to consider.
OXenstored has the following observed gains:
- 1/5th the size in terms of line of code in comparison to the C xenstored
- better performance increasing support for the number of guests, it supports 3 times number of guests for an upper limit of 160 guests
The performance gains come from two things:
- how it deals with transactions through an immutable prefix tree. Each transaction is associated with a triplet (T1, T2, p) where T1 is the root of the database just before a transaction, T2 is the local copy of the database with all updates made by the transaction made up to that point, p is the path to the furthest node from the root T2 whose subtree contains all the updates made by the transaction up that point.
- how it deals with sharing immutable subtrees and uses ‘reference cell equality’, a limited form of pointer equality, which compares the location of values instead of the values themselves. Two values are shared if they share the same location. Functional programming languages enforce that multiple copies of immutable structures share the same location in memory. oxenstored takes avantage of this functional programming feature to design trie library which enforces sharing ofÂ subtrees as much as possible. This lets them simpilfy how to determine and merge / coalesce concurrent transactions.
The complexity of the algorithms used by oxenstored is confined only to the length of the path, which is rarely over 10. This gives predictable performanceÂ regardless of the number of guests present.