Memory Claims

Overview

Xen’s page allocator supports a claims mechanism that allows a domain builder to reserve memory before allocation begins, preventing concurrent allocations from exhausting available pages mid-build. A claim can be global (host-wide) or target a specific NUMA node, ensuring that a domain’s memory is allocated locally on the same node as its vCPUs.

The host-wide claims check subtracts global claims from total available pages. If the domain has claims, its d->outstanding_pages are added back as available (simplified pseudo-code):

ASSERT(spin_is_locked(&heap_lock));
unsigned long global_avail = total_avail_pages - outstanding_claims
                                                 + d->outstanding_pages;
return alloc_request <= global_avail;

Similarly, the per-node check enforces node-level claims by subtracting outstanding node claims from available node pages, and adding back the domain’s claim if allocating from the claimed node:

ASSERT(spin_is_locked(&heap_lock));
unsigned long avail = node_avail_pages(node)
                      - node_outstanding_claims(node)
                      + (node == d->claim_node ? d->outstanding_pages : 0);
return alloc_request <= avail;

Simplified pseudo-code for the claims checks in the buddy allocator:

struct page_info *get_free_buddy(order, memflags, d) {
    for ( ; ; ) {
        node = preferred_node_or_next_node();
        if (!node_allocatable_request(d, memflags, 1 << order, node))
            goto try_next_node;
        /* Find a zone on this node with a suitable buddy */
        for (zone = highest_zone; zone >= lowest_zone; zone--)
            for (j = order; j <= MAX_ORDER; j++)
                if (pg = remove_head(&heap(node, zone, j)))
                    return pg;
     try_next_node:
        if (req_node != NUMA_NO_NODE && memflags & MEMF_exact_node)
            return NULL;
        /* Fall back to the next node and repeat. */
    }
}

struct page_info *alloc_heap_pages(d, order, memflags) {
    if (!host_allocatable_request(d, memflags, 1 << order))
        return NULL;
    pg = get_free_buddy(order, memflags, d);
    if (!pg) /* Retry allowing unscrubbed pages */
        pg = get_free_buddy(order, memflags|MEMF_no_scrub, d);
    if (!pg)
        return NULL;
    if (pg has dirty pages)
        scrub_dirty_pages(pg);
    return pg;
}

Note

The first get_free_buddy() pass skips unscrubbed pages and may fall back to other nodes. With memflags & MEMF_exact_node, no fallback occurs, so the first pass may return NULL. The 2nd pass with MEMF_no_scrub will consider the unscrubbed pages. alloc_heap_pages() then scrubs them before returning, guaranteeing the domain gets the desired node-local pages even when scrubbing is pending.

Therefore, toolstacks should set MEMF_exact_node in memflags when allocating for a domain with a NUMA-aware claim to with XENMEMF_exact_node(node).

For efficient scrubbing, toolstacks might want to run domain builds pinned on a CPU of the target NUMA node to scrub the pages on that node without cross-node traffic and lower latency to speed up domain build.

Data Structures

The following diagram shows the relationships between global, per-node, and per-domain claim counters, all protected by the global heap_lock.

        graph TB
 subgraph "Protected by the heap_lock"
    direction TB
    Global --Sum of--> Per-node
    Per-node --Sum of--> Per-domain
 end
 subgraph Per-domain
     direction LR
     claim_node["d->claim_node"]
     claim_node --claims on--> outstanding_pages["d->outstanding_pages"]
 end
 subgraph Per-node
     direction LR
     node_outstanding_claims--constrains-->node_avail_pages
 end
 subgraph Global
     direction LR
     outstanding_claims--constrains-->total_avail_pages
 end