3.3. Claims Design
3.3.1. Introduction
Xen’s page allocator supports a claims API that allows privileged domain builders to reserve an amount of available memory before populating the guest physical memory of new domains they are creating, configuring and building.
These reservations are called claims. They ensure that the claimed memory remains available for the domains when allocating it, even if other domains are allocating memory at the same time.
Installing claims is a privileged operation performed by domain builders before they populate the guest physical memory. This prevents other domains from allocating memory earmarked for domains under construction. Xen maintains the per-domain claim state for pages that are claimed but not yet allocated.
When claim installation succeeds, Xen updates the claim state to reflect the new targets and protects the claimed memory until it is allocated or the claim is released. As Xen allocates pages for the domain, claims are retired by reducing the claim state by the size of each allocation.
3.3.2. Design Goals
The design’s primary goals are:
Allow domain builders to claim memory on multiple NUMA nodes using a claim set atomically.
Preserve the existing XENMEM_claim_pages hypercall command for compatibility with existing domain builders and its legacy semantics, while introducing a new, unrestricted hypercall command for new use cases such as NUMA-aware claim sets.
Global claims are supported for compatibility with existing domain builders and for use cases where a flexible claim that can be satisfied from any node is desirable, such as on UMA machines or as a fallback for memory that comes available on any node. This means we cannot remove or replace the legacy global claim call nor the needed variables maintaining the global claim state. They are still very much needed: claims are not just for NUMA use cases, but for parallel domain builds in general.
Only on UMA machines is a global claim the same as a claim on node 0, but the same is not true for NUMA machines, where global claims can claim more memory than any single node, and the global claim can be used as a flexible fallback for claiming memory on any node, which can be useful when preferred NUMA node(s) should be claimed, but may have insufficient free memory at the time of claim installation, and the global claim can ensure that the shortfall is available from any node.
Use fast allocation-time claims protection in the allocator’s hot paths to protect claimed memory from parallel allocations from other domain builders in case of parallel domain builds, and to protect claimed memory from allocations from already running domains.
3.3.3. Design Overview
The legacy XENMEM_claim_pages hypercall is superseded by XEN_DOMCTL_claim_memory. This hypercall installs a claim set. It is an array of memory_claim_t entries, where each entry specifies a page count and a target: either a specific NUMA node ID or a special selector (for example, a global or flexible claim).
Like legacy claims, claim sets are validated and installed under domain.page_alloc_lock and heap_lock: Either the entire set is accepted, or the request fails with no side effects. Repeated calls to install claims replace any existing claims for the domain rather than accumulating.
As installing claim sets after allocations is not a supported use case, the legacy behaviour of subtracting existing allocations from installed claims is somewhat surprising and counterintuitive, and page exchanges make incremental per-node tracking of already-allocated pages on a per-node basis difficult. Therefore, claim sets do not retain the legacy behaviour of subtracting existing allocations, optionally on a per-node basis, from the installed claims across the individual claim set entries.
Summary:
Legacy domain builders can continue to use the previous (now deprecated) XENMEM_claim_pages hypercall command to install single-node claims with the legacy semantics and, aside from improvements or fixes to global claims in general, observe no changes in their behaviour.
Updated domain builders can take advantage of claim sets to install NUMA-aware claims on multiple NUMA nodes and/or globally in a single step.
For readers following the design in order, the next sections cover the following topics:
Claim Installation Paths explains how claim sets are installed.
Protection of Claims describes how claimed memory is protected during allocation.
Claim Retirement explains how claims are retired as allocations succeed.
Claims Accounting describes the accounting model that underpins those steps.
3.3.4. Key design decisions
- node_outstanding_claims[MAX_NUMNODES]
Tracks the sum of all claims on a node. get_free_buddy() checks it before scanning zones on a node, so claimed memory is protected from other allocations.
- claims_retire_allocation()
Retires claims in order: First, it retires claims from the allocation node’s claim. If this is not sufficient, it retires claims from the global claim as a fallback. This allows the global claim to be used as a flexible fallback for claiming allocations on any node. Finally, remaining claims are retired from other nodes to prevent the increase of domain_tot_pages(domain) caused by the allocation on top of domain.global_claims and domain.node_claims to exceed domain.max_pages.
- domain.global_claims (formerly domain.outstanding_claims)
Support for global claims is maintained for two reasons: first, for compatibility with existing domain builders, and second, for use cases where a flexible claim that can be satisfied from any node is desirable.
When the preferred NUMA node(s) for a domain do not have sufficient free memory to satisfy the domain’s memory requirements, global claims provide a flexible fallback for the memory shortfall from the preferred node(s) that can be satisfied from any available node.
In this case, domain builders can exploit a combination of passing the preferred node to xc_domain_populate_physmap() and NUMA node affinity to steer allocations towards the preferred NUMA node(s), while letting the global claim ensure that the shortfall is available.
This allows the domain builder to define a set of desired NUMA nodes to allocate from and even specify which nodes to prefer for an allocation, but the claim for the shortfall is flexible, not specific to any node.
3.3.5. Non-goals
3.3.5.1. Legacy behaviours
Installing claims is a privileged operation performed by domain builders before they populate guest memory. As such, tracking previous allocations is not in scope for claims.
For the following reasons, claim sets do not retain the legacy behaviour of subtracting existing allocations from installed claims:
Xen does not currently maintain a
d->node_tot_pages[node]count, and the hypercall to exchange extents of memory with new memory makes such accounting relatively complicated.The legacy behaviour is somewhat surprising and counterintuitive. Because installing claims after allocations is not a supported use case, subtracting existing allocations at installation time is unnecessary.
Claim sets are a new API and can provide more intuitive semantics without subtracting existing allocations from installed claims. This also simplifies the implementation and makes it easier to maintain.
3.3.5.2. Versioned hypercall
The domain builders using the XEN_DOMCTL_claim_memory hypercall also need to use other version-controlled hypercalls which are wrapped through the libxenctrl library.
Wrapping this call in libxenctrl is therefore a practical approach; otherwise, we would have a mix of version-controlled and unversioned hypercalls, which could be confusing for API users and for future maintenance. From the domain builders’ viewpoint, it is more consistent to expose the claims hypercall in the same way as the other calls they use.
Stable interfaces also have drawbacks: with stable syscalls, Linux needs to maintain the old interface indefinitely, which can be a maintenance burden and can limit the ability to make improvements or changes to the interface in the future. Linux carries many system call successor families, e.g., oldstat, stat, newstat, stat64, fstatat, statx, with similar examples including openat, openat2, clone3, dup3, waitid, mmap2, epoll_create1, pselect6 and many more. Glibc hides that complexity from users by providing a consistent API, but it still needs to maintain the old system calls for compatibility.
In contrast, versioned hypercalls allow for more flexibility and evolution of the API while still providing a clear path to adopt new features. The reserved fields and reserved bits in the structures of this hypercall allow for many future extensions without breaking existing callers.
3.3.6. Glossary
- claims
Reservations of memory for domains that are installed by domain builders before populating the domain’s memory. Claims ensure that the reserved memory remains available for the domains when allocating it, even if other domains are allocating memory at the same time.
- claim set
An array of memory_claim_t entries, each specifying a page count and a target (either a NUMA node ID or a special value for global claims), that can be installed atomically for a domain to reserve memory on multiple NUMA nodes. The chapter on Claim sets provides further information on the structure and semantics of claim sets.
- claim set installation
- installing claim sets
- installing claims
The process of validating and installing a claim set for a domain under domain.page_alloc_lock and heap_lock, ensuring that either the entire set is accepted and installed, or the request fails with no side effects. The chapter on Claim set installation provides further information on the structure and semantics of claim sets.
- domain builders
Privileged entities (such as toolstacks in management domains) responsible for constructing and configuring domains, including installing claims, populating memory, and setting up other resources before the domains are started.
- domains
Virtual machine instances managed by Xen, built by domain builders.
- global claims
claims that can be satisfied from any NUMA node, required for compatibility with existing domain builders and for use cases where strict node-local placement is not required or not possible, such as on UMA machines or as a fallback for memory that comes available on any node.
- libxenctrl
A library used by domain builders running in privileged domains to interact with the hypervisor, including making hypercalls to install claims and populate memory.
- libxenguest
A library used by domain builders running in privileged domains to interact with the hypervisor, including making hypercalls to install claims and populate memory.
- meminit
The phase of a domain build where the guest’s physical memory is populated, which involves allocating and mapping physical memory for the domain’s guest physmap. This should be performed after installing claims to protect the process against parallel allocations of other domain builder processes in case of parallel domain builds.
It is implemented in libxenguest and optionally installs claims to ensure the claimed memory is reserved before populating the physmap using calls to xc_domain_populate_physmap().
- nodemask
A bitmap representing a set of NUMA nodes, used for status information like node_online_map and the domain.node_affinity.
- node
- NUMA node
- NUMA nodes
A grouping of CPUs and memory in a NUMA architecture. NUMA nodes have varying access latencies to memory, and NUMA-aware claims allow domain builders to reserve memory on specific NUMA nodes for performance reasons. Platform firmware configures what constitutes a NUMA node, and Xen relies on that configuration for NUMA-related features.
When this design refers to NUMA nodes, it is referring to the NUMA nodes as defined by the platform firmware and exposed to Xen, initialized at boot time and not changing at runtime (so far).
The NUMA node ID is a numeric identifier for a NUMA node, used whenever code specifies a NUMA node, such as the target of a claim or indexing into arrays related to NUMA nodes.
NUMA node IDs start at 0 and are less than MAX_NUMNODES.
Some NUMA nodes may be offline, and the node_online_map is used to track which nodes are online. Currently, Xen does not support hotplug of NUMA nodes, so the set of online NUMA nodes is determined at boot time based on the platform firmware configuration and does not change at runtime.
- NUMA node affinity
The preference of a domain for a set of NUMA nodes, which can be used by domain builders to guide memory allocation even when not forcing the buddy allocator to only consider (or prefer) a specific node when allocating memory, but even a set of preferred NUMA nodes.
By default, domains have NUMA node auto-affinity, which means their NUMA node affinity is determined automatically by the hypervisor based on the CPU affinity of their vCPUs, but it can be disabled and configured.
- guest physical memory
- physmap
The mapping of a domain’s guest physical memory to the host’s machine address space. The physmap defines how the guest’s physical memory corresponds to the actual memory locations on the host.
- populating
The process of allocating and mapping physical memory for a domain’s guest physmap, performed by the domain builders, preferably after installing claims to protect the process against parallel allocations of other domain builder processes in case of parallel domain builds.
- toolstacks
Privileged entities (running in privileged domains) responsible for managing domains, including building, configuring, and controlling their lifecycle using domain builders. One toolstack may run multiple domain builders in parallel to build multiple domains at the same time.