Packets at Line Rate: How to Actually Use AF_XDP

About 5 months ago I decided to create my own (minimal) TCP/IP implementation in user space. After spending some time looking over my options, I decided that I would build it atop AF_XDP sockets. They offer amazing performance, are very well supported, and unlike certain other options (DPDK) they don’t take ownership of the NIC in a way that is disruptive to other processes. The fact that it would also give me a chance to learn a bit more about eBPF was just the cherry on top.

A month after that I began writing this article, which is my attempt of saving you 5 months of your life. XDP documentation is scarce, so scarce that it might as well not exist, and when it does exist it is usually filled with lies meant to deceive and misguide you. Of course, I’m also a deceitful evil liar, but I do try to be good sometimes, so if you notice something wrong about this article then contact me and I’ll fix it.

This blog post will be written in Rust as that is the language I’m working in for my project. I do presume knowledge with Rust as well as understanding of common syscalls, so those won’t be explained here. Lastly, I have tried to make the explanations thorough enough (and the code well written enough) so that you don’t need to be a Rust expert to follow along.

A few additional quick notes before we start:

What Is It and How Does It Work?

XDP, short for eXpress Data Path, is a high-performance network path offered by the Linux* kernel to allow sending and receiving packets at high rates. It works by allowing you to bypass large chunks of the kernel’s networking stack and work with raw packets through a shared memory buffer. This erases much of the cost of crossing the user/kernel space boundary when working with system calls through a combination of shared memory buffers, ring buffers, and direct memory access (when supported by the NIC).

Before we dive into making anything, here’s a brief overview of the parts that we need.

* Windows has recently introduced an equivalent feature. This post is exclusively about Linux XDP and will not have any Windows related discussion.

AF_XDP Socket

This a simple file descriptor socket which we use to communicate with the kernel for a lot of we’re going to do. All data sending and receiving is tied to some AF_XDP socket. These sockets are also commonly referred to as xsk, short for Xdp SocKet

UMEM Buffer

A UMEM (or user memory) buffer is how packets are passed back and forth between user space and the kernel. Depending on how the setup is configured either the kernel or the NIC itself (through DMA) will read and write packets from/to this buffer. When used to its maximum potential it can allow skipping packet buffer allocation, (some of) the cost of crossing the user/kernel space boundary including skipping the cost of copying memory between the two.

RX, TX, Fill, and Completion Buffers

However, the UMEM buffer by itself isn’t enough. To coordinate access to it with the kernel we will be using 4 SPSC (single producer single consumer) ring buffers. Each buffer will look something like this:

1+----------+----------+-------+-----+---------+---------+-----+-----------+
2| producer | consumer | flags | ... | entry 0 | entry 1 | ... | entry n-1 |
3+----------+----------+-------+-----+---------+---------+-----+-----------+
4^                                   ^         ^         ^
5|                                   |         |         |
6+-----------------------------------+         +---------+
7 The size of this chunk varies we'll           Size/type
8 see how we know how much it is                specific
9 later                                         to buffer

As we can see, each buffer has a producer (tail), a consumer (head), flags, and some mysterious chunk of memory, all four of which occupies some unknown amount of space. This is then followed by n (where n is a power of two) many entries in the buffer. The producer and consumer are both atomic u32s* whose values are indices into the buffer (entries). The type of each entry varies depending on which buffer we’re talking about. The range of valid entries (i.e. ones produced but not yet consumed) starts at (and includes) the consumer index and ends at (but excludes) the producer index.

Note that for any given buffer you are in control of either the producer or the consumer (but never both), as such you need to make sure that the following holds true:

Note that all the above comparisons use modulo and that none of the rules require you to wrap the values back around to 0, this is because you’re explicitly not supposed to. The counters are always incrementing, and so in that sense they contain the “total” amount of entries that have been produced or consumed respectively. While you could use a modulo operation to access the index they represent, this is highly inefficient. The better way of doing it is to store a mask on a per ring basis whose value is n - 1. Due to the nature of n being a power of two cursor & mask will yield the same value as cursor % n.

For a more complete description of the kernel’s circular buffers refer to the kernel.org documentation: https://www.kernel.org/doc/html/latest/core-api/circular-buffers.html

See the following reference chart for who is responsible for the producer and who is responsible for the consumer in each ring buffer. User space is “us” and the kernel is- well I think you can figure that one out.

Consumer Producer
RX user space kernel
TX kernel user space
Fill kernel user space
Completion user space kernel

* Though the amount of atomics code written in this article is little, it is something you will have to contend with as that is the base mechanism of communicating with the kernel in these ring buffers. These operations are notoriously difficult to get right, and thus great care has to be taken when writing them, especially in multithreaded user space code with more than one socket in use. I will not be going over the details of the various memory orderings that there are or how to use atomics as that is far out of the scope of this article. If you need a refresher I would highly recommend watching Herb Sutter’s excellent Atomic<> Weapons talk at C++ and Beyond 2012

RX/TX Buffers

The RX (receive) buffer is used to inform of us of when ingress packets have been fully written to the UMEM buffer and are ready for us to read. Similarly, we use the TX (transmit) buffer to inform the kernel of when egress packets we’ve written to the UMEM buffer are ready to be sent out.

The entries in either buffer are of type xdp_desc.

1// c code
2struct xdp_desc {
3	__u64 addr;
4	__u32 len;
5	__u32 options;
6};

Fill/Completion Buffers

The fill buffer informs the kernel of chunks it may read ingress packets to. The completion buffer is used by the kernel to inform us of when it has finished transmitting a packet at a given chunk.

The entries in either buffer are of type u64 where the value is an offset/index of the chunk which is being passed to either the kernel or the process depending on the buffer in question.

Ring Buffer Pairings

If you understand the memory ownership model in Rust or in languages like it then this might all be very familiar. Essentially, when we first allocate the UMEM buffer all of its chunks are owned by the user space application that allocated it (us). Ownership of a chunk means that the owner is the only party that can read from or write to that chunk. In order to let the kernel write packets to memory or read packet data to send off we have to give the kernel ownership over certain chunks, after which it gives it back. A chunk which is owned by the kernel shall not be read from or written to by the user, and in turn the kernel will do the same for chunks which are owned by the user.

The buffer pairings which enable this ownership swapping are the RX/fill buffer and the TX/completion buffer pairs.

In the case of RX/fill:

  1. We start by owning some chunk of memory in the UMEM buffer
  2. We write an entry in the fill buffer containing the offset of the chunk
  3. We release ownership of it by incrementing the producer thus letting the kernel see the entry
  4. The kernel will then consume the fill buffer entry, and take ownership of the chunk specified
  5. Once a packet comes in and is redirected to us by an XDP program (don’t worry we’ll talk about that later), then the kernel will write its content to one of the UMEM buffer slots it owns
  6. The kernel writes an entry in the RX buffer containing the offset of the chunk
  7. The kernel releases ownership of the chunk by incrementing the producer
  8. We read the RX buffer entry to locate the chunk which we now own, after which we increment the consumer to make the RX buffer entry available for the producer to use

In the case of TX/completion:

  1. We start by owning some chunk as before
  2. We write some packet data to the chunk
  3. We write an entry in the TX buffer containing the offset of the chunk and the length of the data we put into it
  4. We release ownership to the kernel by incrementing the producer
  5. The kernel will consume the entry and begin transmitting the data
  6. Once the kernel has finished transmitting, it will write an entry to the completion buffer with the offset of the chunk
  7. The kernel releases ownership of the chunk by incrementing the producer
  8. We see that a new entry is made whose contents give the offset to the chunk which we now own, after which we increment the consumer to make the completion buffer entry available for the producer to use

Note on multiple sockets: One important thing to keep in mind is that there can be more than one AF_XDP socket associated with a UMEM buffer. In such a case each socket gets its own TX/RX buffers, but there can only ever be one fill ring and one completion ring per UMEM buffer.

eBPF XDP Program

eBPF programs are a special type of program that allow us to run code in kernel space that hooks into various defined event points, with one of those events being packet arrival. This program will be situated in the network stack and is invoked when a new packet arrives to the NIC* that it is bound to, upon which it will decide the packets fate. The program’s options are as follows:

The last option being the one that is most relevant to what we will be doing. When this occurs the packet goes through the Fill/RX ring buffer cycle described above.

Note: You will be told by the internet that you do not necessarily need an RX/fill buffer pair or an eBPF program if you don’t plan on receiving data. While nothing explicitly requires it, you might run into driver specific issues See driver exceptions section for more info. TL;DR: kind of, but if you want maximum portability then you need them.

* Technically the driver sees the packet before you. Additionally, XDP programs may be chained together (thought the functionality for doing that is underdeveloped) so any one program doesn’t have to be the first to see a packet which gets to it.

Scaffolding

We’re going to be defining some helper functions and structs here which we’ll be using repeatedly as we go through this article.

We’ll be calling setsockopt a lot, and while this is not a perfectly safe wrapper it does make calling it a little less tedious.

 1/// A thin wrapper around [libc::setsockopt] that maps the return value to a [Result]
 2///
 3/// # Parameters:
 4///
 5/// * `op_name`: The name value of the operation to perform, e.g. [libc::XDP_UMEM_REG].
 6/// * `value` : Reference to the value that is to be passed.
 7///
 8/// # Safety
 9/// The caller must ensure that:
10/// * `op_name` is a valid value to pass as the name parameter to [libc::setsockopt]
11/// * `value` is a valid value to give to the operation associated with `op_name`
12unsafe fn setsockopt<T>(fd: &XdpFd, op_name: i32, value: &T) -> Result<(), ioError> {
13    let result = unsafe {
14        libc::setsockopt(
15            fd.as_raw_fd(),
16            libc::SOL_XDP,
17            op_name,
18            value as *const _ as _,
19            size_of::<T>() as _,
20        )
21    };
22
23    if result == 0 {
24        Ok(())
25    } else {
26        Err(ioError::last_os_error())
27    }
28}

For similar reasons and with a similar implementation we’ll also be creating a wrapper around getsockopt.

 1/// A thin wrapper around [libc::getsockopt] that maps the return value to a [Result]
 2///
 3/// # Parameters:
 4///
 5/// * `op_name`: The name value of the operation to perform, e.g. [libc::XDP_UMEM_REG].
 6/// * `value` : Reference to a type to be written to.
 7///
 8/// # Safety
 9///
10/// When you call this function you have to ensure that `op_name` is a valid operation value and
11/// that `value` is a valid type and value to pass as the operation's output parameter.
12unsafe fn getsockopt<T>(fd: &XdpFd, op_name: i32, value: &mut T) -> Result<(), ioError> {
13    let result = unsafe {
14        libc::getsockopt(
15            fd.as_raw_fd(),
16            libc::SOL_XDP,
17            op_name,
18            value as *mut _ as _,
19            &mut (size_of::<T>() as libc::socklen_t) as *mut _,
20        )
21    };
22
23    if result == 0 {
24        Ok(())
25    } else {
26        Err(ioError::last_os_error())
27    }
28}

Additionally, we’re also going to define a DerefNonNull struct which is a wrapper around NonNull that implements Deref and DerefMut. We’re going to be working with NonNull pointers that we know are valid to turn into references and this will make it much more ergonomic to write code once we start interacting with our structs. Additionally when it comes to ring buffers it can help us work around Rust’s borrow checker when it comes to self-referential structs without losing any safety guarantees.

 1/// A Simple wrapper around a [NonNull] pointer to `T` that implements [Deref] and [DerefMut]
 2/// Invariant: The pointer must be convertible to a reference as specified by [NonNull::as_ref] and
 3/// [NonNull::as_mut]
 4#[repr(transparent)]
 5struct DerefNonNull<T>(NonNull<T>);
 6
 7impl<T> DerefNonNull<T> {
 8    /// Returns a wrapped [NonNull] pointer
 9    ///
10    /// # Safety
11    /// When calling this method you must ensure that the pointer is convertible to a reference per
12    /// the requirements specified by [NonNull] in the documentation for [NonNull::as_ref] and
13    /// [NonNull::as_mut]
14    unsafe fn new(ptr: NonNull<T>) -> Self {
15        Self(ptr)
16    }
17}
18
19impl<T> Deref for DerefNonNull<T> {
20    type Target = T;
21
22    fn deref(&self) -> &Self::Target {
23        // Safety:
24        // The invariant of the struct is that the pointer is convertible to a reference
25        unsafe { self.0.as_ref() }
26    }
27}
28
29impl<T> DerefMut for DerefNonNull<T> {
30    fn deref_mut(&mut self) -> &mut Self::Target {
31        // Safety:
32        // The invariant of the struct is that the pointer is convertible to a reference
33        unsafe { self.0.as_mut() }
34    }
35}

Lastly is SharedBuffer. Since we’re going to be sharing our buffers with the kernel, and some of them are concurrently modified by it, we’re going to need this struct. Holding a reference to a buffer which is modified by some external process is undefined behavior in Rust. We’ll use this struct to offload the responsibility of safety onto the caller. It is of course possible to build safe abstractions over such access, but doing so is complicated and would take away from the focus of this article: XDP.

 1/// A wrapper around a [[T; N]] buffer that is shared with and concurrently modified by the kernel
 2/// Invariants:
 3/// * The wrapped pointer is convertible to a reference as specified by the docs in
 4///   [NonNull::as_ref] and [NonNull::as_mut]
 5/// * The memory pointed to by the underlying pointer is at least large enough to hold N many Ts
 6#[repr(transparent)]
 7struct SharedBuffer<T, const N: usize>(NonNull<[T; N]>);
 8
 9impl<T, const N: usize> SharedBuffer<T, N> {
10    /// Returns a [SharedBuffer] wrapping the [NonNull] pointer
11    ///
12    /// # Safety:
13    /// THe pointer passed must uphold the struct invariants
14    pub unsafe fn new(ptr: NonNull<[T; N]>) -> Self {
15        Self(ptr)
16    }
17
18    /// Gets a reference to the element at the specified buffer index
19    ///
20    /// # Safety
21    /// You must ensure that the given index is safe to cast as a reference per the requirements
22    /// specified by [NonNull] in the documentation for [NonNull::as_ref]. Especially the
23    /// requirement that the data referenced may not be modified while this reference is alive,
24    /// including kernel modifications.
25    pub unsafe fn get_unchecked(&self, idx: usize) -> &T {
26        assert!(idx < N, "Index out of range");
27        // Safety:
28        // we know that idx is in a valid range
29        unsafe { self.0.cast::<T>().add(idx).as_ref() }
30    }
31
32    /// Gets a mutable reference to the element at the specified buffer index
33    ///
34    /// # Safety
35    /// You must ensure that the given index is safe to cast as a reference per the requirements
36    /// specified by [NonNull] in the documentation for [NonNull::as_mut]. Especially the
37    /// requirement that the data referenced may not be modified while this reference is alive,
38    /// including kernel modifications.
39    pub unsafe fn get_unchecked_mut(&mut self, idx: usize) -> &mut T {
40        assert!(idx < N, "Index out of range");
41        unsafe { self.0.cast::<T>().add(idx).as_mut() }
42    }
43
44    /// Gets a const pointer to the buffer
45    pub fn as_ptr(&self) -> *const [T; N] {
46        self.0.as_ptr()
47    }
48
49    /// Gets a mut pointer to the buffer
50    pub fn as_mut_ptr(&self) -> *mut [T; N] {
51        self.0.as_ptr()
52    }
53}

XdpSock

As you can see from the overview, we have a lot of resources that we need to manage, so we’re going to create an XdpSock struct to hold and coordinate all of them. We’ll add on to it as we go, for now it’s just going to be an empty stub. We’re also going to create an associated error type for better error handling.

 1// note: we're doing this because we'll be using std::io::Error a lot but I don't want to override
 2// the name Error. Any time you see ioError from now on, it's referring to this
 3use std::io::Error as ioError;
 4// other use statements are implicit
 5
 6#[derive(Debug, Error)]
 7pub enum XdpSockError {
 8}
 9
10/// A ready-to-use high level wrapper around an XDP socket
11pub struct XdpSock {
12}
13
14impl XdpSock {
15    /// Attempts to create an [XdpSock]
16    /// 
17    /// # Returns
18    /// An [XdpSock] or an [XdpSockError] on error
19    pub fn new() -> Result<XdpSock, XdpSockError> {
20        // Order matters here!
21
22        // Create the actual socket
23
24        // Create the umem buffer
25
26        // Create the ring buffers
27
28        // Bind it together
29
30        // Create the struct
31
32        Ok(Self {})
33    }
34}

Creating a Socket

First thing we need to do is create an AF_XDP socket. This is as easy as creating a normal socket using the socket syscall.

1let fd = unsafe { libc::socket(libc::AF_XDP, libc::SOCK_RAW, 0) };
2if fd == -1 {
3    // error handling goes here
4}
5else {
6    // we have a valid socket
7}

Safe Wrapper

We’re going to create a safe wrapper for our socket using the newtype idiom.

 1/// A thin low-level wrapper around an XDP socket file descriptor
 2pub struct XdpFd(libc::c_int);
 3
 4impl XdpFd {
 5    /// Attempts to create a new [XdpFd]
 6    ///
 7    /// # Returns
 8    /// An [XdpFd] or [std::io::Error] on error
 9    fn new() -> Result<XdpFd, ioError> {
10        // Safety:
11        // This function returns -1 on failure to create a socket, and is not unsafe to call.
12        // We check that the fd returned is valid before creating an XdpFd out of it which ensures
13        // a valid struct.
14        let fd = unsafe { libc::socket(libc::AF_XDP, libc::SOCK_RAW, 0) };
15        if fd == -1 {
16            return Err(ioError::last_os_error());
17        }
18
19        Ok(Self(fd))
20    }
21}
22
23impl AsRawFd for XdpFd {
24    fn as_raw_fd(&self) -> RawFd {
25        self.0
26    }
27}
28
29impl Drop for XdpFd {
30    fn drop(&mut self) {
31        // Safety:
32        // This function is safe to call on a file descriptor so long as it is a valid one
33        // returned from a socket or open call. We have a valid file descriptor as it is
34        // is not possible to construct an instance of XdpFd without one.
35        unsafe {
36            libc::close(self.0);
37        }
38    }
39}

Finally we can add it to the XdpSock struct

 1#[derive(Debug, Error)]
 2pub enum XdpSockError {
 3    #[error("Failed to create xdp socket: {0}")]
 4    Socket(std::io::Error)
 5}
 6
 7pub struct XdpSock {
 8    /// The XDP socket file descriptor
 9    fd: XdpFd,
10}
1// inside XdpSock::new()
2
3// Create the actual socket
4let fd = XdpFd::new().map_err(XdpSockError::Socket)?;

Creating the UMEM Buffer

To create the UMEM buffer we need to pick out two parameters. The first is the chunk size, which is how big a packet and its associated data can be*. The second is the chunk count, which is how many chunks are allocated. The chunk size must be greater than or equal to 2048, less than or equal to the system’s page size, and it must also be a power of two. For most systems this means that your only options are 2048 or 4096. The chunk count on the other hand can be any value you want. Lastly, you will need to ensure that your buffer is aligned to the system page size**. We’re going to pick 4096 for both parameters in this example.

* This is a lie. You can split a packet over many entries, however this is a configuration option that is explained later on. Also thinking about 1 chunk = 1 packet is a useful mental model for the time being, just keep in mind it doesn’t have to be the case.

** See the registration section for unaligned buffers

Allocation

The first step once we’ve picked out our parameters is to allocate the buffer. We can easily get page aligned memory by using mmap as such:

 1// allocate the memory (raw pointer)
 2let addr = unsafe {
 3    libc::mmap(
 4        std::ptr::null_mut(),
 5        4096 * 4096,
 6        libc::PROT_READ | libc::PROT_WRITE,
 7        libc::MAP_ANONYMOUS | libc::MAP_PRIVATE,
 8        -1,
 9        0,
10    )
11};
12
13if addr == libc::MAP_FAILED {
14    // something went wrong
15}
16else {
17    // allocation success
18}

Memory returned by mmap is guaranteed to be page aligned if the address is null, which makes it a convenient way to get page aligned memory. You can of course do this by finding the page alignment programmatically and allocate memory using that information through whatever allocator you have, but this is much easier. Also, we have to use mmap later so you can’t run away from it for long.

Registration

We will also need to register the buffer with the kernel. To do this we’re going to have to perform a setsockopt call with this struct.

1// c code
2struct xdp_umem_reg {
3    __u64 addr; /* Base address of the umem buffer */
4    __u64 len; /* Length of the umem buffer */
5    __u32 chunk_size; /* Size of each chunk */
6    __u32 headroom; /* See below for the last three */
7    __u32 flags;
8    __u32 tx_metadata_len;
9};

The address field is simply the pointer we got from mmap, but cast into a u64. len, and chunk_size fields are fairly self explanatory so I won’t go over those.

The headroom field is there to let the kernel know to leave some space for our application at the start of the chunk. This tells the kernel to write packets at addr + (n * chunk_size) + headroom when writing to the nth chunk. Which leaves you with headroom many bytes for prepending data. This field is useful when your program needs to encapsulate a packet and re-transmit it as it avoids the expensive operation of moving packet data by the desired amount of bytes. This way the encapsulating data can be written to the headroom and have the chunk be passed back to the TX buffer immediately.

The flags field allows you to define some combination of three possible flags. Those flags are:

The tx_metadata_len field is used to inform the kernel that packets might provide or request transmission metadata. This field is a bit of an oddball though as it didn’t always exist, and after it was created, the associated flag didn’t exist for a while (any non 0 value implicitly enabled it). Now, if you attempt to use this field without the flag being enabled (if the field exists and the flag is supported) it will result in the registration operation failing. So if you’re striving for maximum compatibility while using this field, you should attempt registering the UMEM buffer without setting the associated flag. If that operation fails then attempt it again with the associated flag set.

This field is somewhat equivalent to headroom in terms of behavior, except it is us that is leaving some headroom at the start of the chunk rather than the kernel. That space is used in order to give the kernel metadata about packet transmission and to request data about the transmission result. The value for tx_metadata_len must be greater than or equal to sizeof(struct xsk_tx_metadata) ( which we will refer to as x for brevity). Though the value may be greater than x, the kernel will only read from and write to the first x many bytes and treat the rest as padding. The following (modified) diagram from the kernel.org TX metadata documentation (see sources) shows what this looks like:

 1       tx_metadata_len
 2+---------------------------+
 3|                           |
 4v                           v
 5+-----------------+---------+----------------------------+
 6| xsk_tx_metadata | padding |          payload           |
 7+-----------------+---------+----------------------------+
 8                            ^
 9                            |
10                      xdp_desc->addr

As we can see, the payload start is pointed to by an entry (in the TX buffer), and the metadata is obtained by the kernel through xdp_desc->addr - tx_metadata_len. Only the first x many bytes of the metadata are used, and the rest are treated as padding. Note that simply enabling the metadata option when registering the UMEM buffer is not enough to actually provide metadata. This option simply tells the kernel to look at the first x many bytes to determine what metadata, if any, a packet is providing or requesting. To actually pass metadata along with a packet see the TX buffer specifics

Note: when using the headroom option or the TX metadata option address values in the ring buffer entries should point not to the beginning of a chunk. Instead, they should to the beginning offset by the amount specified by headroom (RX/Fill) or tx_metadata_len (TX/Completion).

Once this struct has been instantiated, we can register the buffer through setsockopt as such:

1match unsafe { setsockopt(&fd, libc::XDP_UMEM_REG, &umem_reg) } {
2    Ok(_) => // registration successful
3    Err(_) => // something went wrong
4}

Safe Wrapper

We’ll definitely be wanting to wrap all that unsafe code behind a safe interface. This also gives us an opportunity to put compile time checks on buffer size parameters.

 1#[derive(Error, Debug)]
 2pub enum UmemError {
 3    #[error("Failed to allocate umem buffer: {0}")]
 4    Alloc(ioError),
 5    #[error("Failed to register umem buffer: {0}")]
 6    Register(ioError),
 7}
 8
 9/// A safe wrapper around a page-aligned UMEM buffer.
10struct Umem<const COUNT: usize, const SIZE: usize> {
11    buffer: SharedBuffer<[u8; SIZE], COUNT>,
12}
13
14impl<const COUNT: usize, const SIZE: usize> Umem<COUNT, SIZE> {
15    const LEN: usize = COUNT * SIZE;
16
17    /// Allocates a page aligned buffer of size [Self::BUF_LEN]
18    pub fn new(fd: &XdpFd) -> Result<Self, UmemError> {
19        const {
20            assert!(
21                SIZE >= 2048,
22                "UMEM chunk size must greater than or equal to 2048"
23            );
24            assert!(
25                SIZE.is_power_of_two(),
26                "UMEM chunk size must be a power of 2"
27            );
28        }
29
30        let buffer = unsafe {
31            let addr = libc::mmap(
32                std::ptr::null_mut(),
33                Self::LEN,
34                libc::PROT_READ | libc::PROT_WRITE,
35                libc::MAP_ANONYMOUS | libc::MAP_PRIVATE,
36                -1,
37                0,
38            );
39
40            if addr == libc::MAP_FAILED {
41                None
42            } else {
43                // mmap only allocates at null if MAP_FIXED is passed in flags
44                Some(NonNull::new(addr).expect("NonNull new failed somehow"))
45            }
46        };
47        let buffer = unsafe {
48            SharedBuffer::new(
49                buffer
50                    .ok_or_else(|| UmemError::Alloc(ioError::last_os_error()))?
51                    .cast(),
52            )
53        };
54
55        Self::register_buffer(&buffer, fd).map_err(UmemError::Register)?;
56
57        Ok(Umem { buffer, fill, comp })
58    }
59
60    /// Registers the [Umem] buffer with the specified [XdpSock]
61    fn register_buffer(buf: &SharedBuffer<[u8; SIZE], COUNT>, fd: &XdpFd) -> Result<(), ioError> {
62        let umem_reg = libc::xdp_umem_reg {
63            addr: buf.as_ptr() as _,
64            len: Self::LEN as _,
65            chunk_size: SIZE as _,
66            headroom: 0,
67            flags: 0,
68            tx_metadata_len: 0,
69        };
70
71        // Safety: XDP_UMEM_REG is a valid operation and umem_reg is a valid parameter to pass it
72        unsafe { setsockopt(&fd, libc::XDP_UMEM_REG, &umem_reg) }
73    }
74}
75
76impl<const COUNT: usize, const SIZE: usize> Drop for Umem<COUNT, SIZE> {
77    fn drop(&mut self) {
78        // Safety:
79        // We acquired this memory through mmap and therefor its safe to pass it to munmap to free
80        // it. The allocation always has the size Self::BUF_LEN
81        unsafe {
82            libc::munmap(self.buffer.as_mut_ptr().cast(), Self::LEN);
83        }
84    }
85}

Finally after all that work is done, we can add Umem as a part of XdpSock, and update it accordingly.

 1#[derive(Debug, Error)]
 2pub enum XdpSockError {
 3    // previous variants omitted for brevity
 4
 5    #[error("Failed to create umem buffer: {0}")]
 6    Umem(UmemError),
 7}
 8
 9pub struct XdpSock {
10    /// The XDP socket file descriptor
11    fd: XdpFd,
12    /// The UMEM buffer
13    umem: Umem<{Self::UMEM_CHUNK_COUNT}, {Self::UMEM_CHUNK_SIZE}>,
14}
1// inside impl XdpSock
2const UMEM_CHUNK_SIZE: usize = 4096;
3const UMEM_CHUNK_COUNT: usize = 4096;
 1// inside XdpSock::new()
 2
 3// ...
 4
 5// Create the umem buffer
 6let umem = Umem::new(&fd).map_err(XdpSockError::Umem)?;
 7
 8Ok(Self{
 9    fd,
10    umem
11})

Note: we’re not quite done with this struct yet, we’ll came to add more to it later at the end of the next section.

Creating Ring Buffers

Now we have to create some ring buffers. This can be very repetitive as there are very little code changes needed between the different kinds of ring buffers. We’ll go over how to create a ring buffer using an RX buffer as example, after which we’ll write a generic solution to avoid code repetition, which as a bonus will double as the safe wrapper.

Registration

Registering a ring is done by registering what its size will be through a simple setsockopt call. We’re going to use a ring size of 1024 for all rings, but you can feel free to change it if you want.

1// remember: setsockopt returns a Result<(), ioError>
2unsafe { setsockopt(fd, libc::XDP_RX_RING, &1024) }
3    .expect("Failed to register RX buffer size");

Getting the Offsets

Remember when I showed you what the ring buffers look like? They had a producer, consumer, flags, some unknown amount of space, and then the ring buffer itself. I promised you we’d talk more about the structure of the buffer and how to find the offsets you need, so let’s do that. There are two structs we need to talk about, the first one is xdp_ring_offset.

1// c code
2struct xdp_ring_offset {
3    __u64 producer;
4    __u64 consumer;
5    __u64 desc;
6    __u64 flags;
7};

We get xdp_ring_offset structs from the kernel, each one is associated with a ring buffer, each field in the struct itself is an offset. When you allocate the buffer you add one of the above offsets to the base pointer to get a pointer to the specified location.

1+----------+----------+-------+-----+---------+---------+-----+-----------+
2| producer | consumer | flags | ... | entry 0 | entry 1 | ... | entry n-1 |
3+----------+----------+-------+-----+---------+---------+-----+-----------+
4                                    ^
5                                    |
6                             addr + offset->desc

One important thing to mention is that the order and offset for any of these fields varies, which is why you get pointers to each individual field. This is to say that things don’t necessarily need to be the order shown here, and fields may have padding before or after them that isn’t being shown. Never assume anything about the layout of the buffer, use the offsets that you are given to locate each field.

Note: The reason the buffers are set up to use offsets like this is to prevent false sharing

The second struct we need to talk about is xdp_mmap_offsets.

1// c code
2struct xdp_mmap_offsets {
3    struct xdp_ring_offset rx;
4    struct xdp_ring_offset tx;
5    struct xdp_ring_offset fr; /* Fill */
6    struct xdp_ring_offset cr; /* Completion */
7};

This struct contains an instance of xdp_ring_offset for each type of ring. To get this struct you must use getsockopt to have the kernel fill out the values for you.

 1let mut offsets: libc::xdp_mmap_offsets = unsafe { std::mem::zeroed() };
 2
 3let result = unsafe {
 4    libc::getsockopt(
 5        fd.as_raw_fd(),
 6        libc::SOL_XDP,
 7        libc::XDP_MMAP_OFFSETS,
 8        &mut offsets as *mut _ as _,
 9        &mut (size_of_val::<>(&offsets) as libc::socklen_t) as *mut _,
10    )
11};
12
13if result == 0 {
14    // offsets.rx has the offsets we want for our buffer now
15}
16else {
17    // something went wrong
18}

Allocation

Allocation is done through an mmap call with some specific flags and offset values.

 1// we need to allocate desc many bytes (the buffer is always the last element) to have enough space
 2// for the "header" section (producer, consumer, flags) the buffer entries
 3let mem_len = offsets.desc as usize + N * size_of::<libc::xdp_desc>();
 4
 5let addr = unsafe {
 6    libc::mmap(
 7        std::ptr::null_mut(),
 8        mem_len,
 9        libc::PROT_READ | libc::PROT_WRITE,
10        libc::MAP_SHARED | libc::MAP_POPULATE,
11        fd.as_raw_fd(), // some XdpFd
12        libc::XDP_PGOFF_RX_RING,
13    )
14};
15
16if addr == libc::MAP_FAILED {
17    // something went wrong
18}
19else {
20    // allocation success
21}

Then we can get the specific parts from the memory we allocated using the offsets we got.

1let producer: NonNull<AtomicU32> = unsafe { addr.add(offsets.rx.producer as _).cast() };
2let consumer: NonNull<AtomicU32> = unsafe { addr.add(offsets.rx.consumer as _).cast() };
3let flags = NonNull<AtomicU32> = unsafe { addr.add(offsets.rx.flags as _).cast() };
4let data = NonNull<[u8; 1024]> = unsafe { addr.add(offsets.rx.desc as _).cast() };

Generic Safe Wrapper

Obviously we don’t want to repeat that 3 more times for the other ring buffer types, especially with all that unsafe all over the place. So let’s introduce the wrapper for the ring buffers.

  1#[derive(Debug, Error)]
  2pub enum RingBufferError {
  3    #[error("Failed to register ring size: {0}")]
  4    RegisterSize(ioError),
  5    #[error("Failed to allocate the buffer: {0}")]
  6    Alloc(ioError),
  7}
  8
  9/// A generic safe-ish wrapper around an XDP ring buffer
 10struct RingBuffer<T, const N: usize> {
 11    addr: NonNull<libc::c_void>,
 12    mem_len: usize,
 13    producer: DerefNonNull<AtomicU32>,
 14    consumer: DerefNonNull<AtomicU32>,
 15    flags: DerefNonNull<AtomicU32>,
 16    data: SharedBuffer<T, N>,
 17    cached_prod: u32,
 18    cached_cons: u32,
 19}
 20
 21impl<T, const N: usize> RingBuffer<T, N> {
 22    /// Allocates a ring buffer given an XDP socket and the offsets associated with it
 23    ///
 24    /// # Safety
 25    /// The caller must ensure that:
 26    /// * `offset_fn` returns the correct offset associated with the ring buffer type
 27    /// * `sockopt` is a valid size registration [setsockopt] operation name for registering ring
 28    ///   size with the kernel (e.g. [libc::XDP_RX_RING]) for this ring type and matches the ring
 29    ///   type specified by the value which `offset_fn` returns
 30    /// * `mmap_off` is a valid offset to pass to [libc::mmap] when allocating a ring buffer
 31    ///   (e.g. [libc::XDP_PGOFF_RX_RING]) for this ring type and matches the ring type specified
 32    ///   by `sockopt`
 33    unsafe fn new(
 34        fd: &XdpFd,
 35        offset: xdp_ring_offset,
 36        sockopt: libc::c_int,
 37        mmap_off: libc::off_t,
 38    ) -> Result<Self, RingBufferError> {
 39        const {
 40            assert!(N > 1, "Buffer size must be greater than 1");
 41            assert!(N.is_power_of_two(), "Buffer size must be a power of two")
 42        }
 43
 44        // Safety:
 45        // Ring buffer length is always a power of two, invariant ensured by constructor
 46        // sockopt is a valid size registration op name as required by function contract
 47        unsafe { setsockopt(fd, sockopt, &N) }.map_err(RingBufferError::RegisterSize)?;
 48
 49        let mem_len = offset.desc as usize + N * size_of::<T>();
 50
 51        // Safety:
 52        // These are the correct parameters to pass to mmap when allocating a ring buffer and
 53        // mmap_off is a valid offset as required by function contract
 54        let addr = unsafe {
 55            let addr = libc::mmap(
 56                std::ptr::null_mut(),
 57                mem_len,
 58                libc::PROT_READ | libc::PROT_WRITE,
 59                libc::MAP_SHARED | libc::MAP_POPULATE,
 60                fd.as_raw_fd(),
 61                mmap_off,
 62            );
 63
 64            if addr == libc::MAP_FAILED {
 65                None
 66            } else {
 67                // mmap only allocates at null if MAP_FIXED is passed in flags
 68                Some(NonNull::new(addr).expect("NonNull new failed somehow"))
 69            }
 70        };
 71
 72        let addr = addr.ok_or_else(|| RingBufferError::Alloc(ioError::last_os_error()))?;
 73
 74        // Safety:
 75        // All add operations are safe as the offsets are provided by the kernel and we allocated
 76        // memory large enough for all of them
 77        // All DerefNonNull::new are safe as the pointers are convertible to a reference which is
 78        // ensured by the fact that we got the pointers from the kernel
 79        let (producer, consumer, flags, data) = unsafe {
 80            (
 81                DerefNonNull::<AtomicU32>::new(addr.add(offset.producer as _).cast()),
 82                DerefNonNull::<AtomicU32>::new(addr.add(offset.consumer as _).cast()),
 83                DerefNonNull::<AtomicU32>::new(addr.add(offset.flags as _).cast()),
 84                SharedBuffer::new(addr.add(offset.desc as _).cast()),
 85            )
 86        };
 87
 88        Ok(Self {
 89            addr,
 90            mem_len,
 91            cached_prod: producer.load(Ordering::SeqCst),
 92            cached_cons: consumer.load(Ordering::SeqCst),
 93            producer,
 94            consumer,
 95            flags,
 96            data,
 97        })
 98    }
 99}
100
101impl<T, const N: usize> Drop for RingBuffer<T, N> {
102    fn drop(&mut self) {
103        // Safety:
104        // We acquired this memory through mmap and therefor its safe to pass it to munmap to free
105        // it. The allocation size is always stored in mem_len
106        unsafe {
107            libc::munmap(self.addr.cast().as_ptr(), self.mem_len);
108        }
109    }
110}
111
112/// A safe wrapper around an XDP socket RX buffer
113#[repr(transparent)]
114struct RxBuffer<const N: usize>(RingBuffer<libc::xdp_desc, N>);
115/// A safe wrapper around an XDP socket TX buffer
116#[repr(transparent)]
117struct TxBuffer<const N: usize>(RingBuffer<libc::xdp_desc, N>);
118/// A safe wrapper around an XDP socket fill buffer
119#[repr(transparent)]
120struct FillBuffer<const N: usize>(RingBuffer<u64, N>);
121/// A safe wrapper around an XDP socket completion buffer
122#[repr(transparent)]
123struct CompletionBuffer<const N: usize>(RingBuffer<u64, N>);
124
125impl<const N: usize> RxBuffer<N> {
126    /// Constructs a new [RxBuffer] associated with the given fd
127    ///
128    /// # Returns
129    /// An [RxBuffer] or a [RingBufferError] on error
130    pub fn new(fd: &XdpFd, offsets: xdp_mmap_offsets) -> Result<Self, RingBufferError> {
131        // Safety:
132        // the arguments provided are the correct ones for constructing rx buffers
133        unsafe { RingBuffer::new(fd, offsets.rx, libc::XDP_RX_RING, libc::XDP_PGOFF_RX_RING) }
134            .map(Self)
135    }
136}
137
138impl<const N: usize> Deref for RxBuffer<N> {
139    type Target = RingBuffer<libc::xdp_desc, N>;
140
141    fn deref(&self) -> &Self::Target {
142        &self.0
143    }
144}
145
146impl<const N: usize> DerefMut for RxBuffer<N> {
147    fn deref_mut(&mut self) -> &mut Self::Target {
148        &mut self.0
149    }
150}
151
152// repeat above impl for the other 3 buffers (omitted for brevity). Make sure to also change the
153// offsets field that is being passed to be the one matching the buffer you are creating.
154// for TX use libc::XDP_TX_RING and libc::XDP_PGOFF_TX_RING
155// for fill use libc::XDP_UMEM_FILL_RING and libc::XDP_UMEM_PGOFF_FILL_RING
156// for completion use libc::XDP_UMEM_COMPLETION_RING and libc::XDP_UMEM_PGOFF_COMPLETION_RING

Note: We cache the values of the producer and consumer. This allows us to store changes (e.g. reserving chunks of the buffer) without actually committing to them. They would have to be kept up to date but it can have serious performance benefits.

Now when it comes to integrating these ring buffers into our existing code we’ll be adding the fill and completion ring buffers to the Umem struct we created, and the RX and TX ring buffers to the XdpSock struct directly. As mentioned earlier there can only ever be one of each of the fill and completion buffers per UMEM buffer, but there can be multiple sockets associated with the UMEM buffer. Despite the fact that this article won’t cover multi-socket setups (beyond explaining the config value), it is more correct to have those buffers inside the Umem struct.

Let’s first start with the Umem struct.

 1#[derive(Error, Debug)]
 2pub enum UmemError {
 3    // previous variants omitted for brevity
 4
 5    #[error("Fill buffer operation failure: {0}")]
 6    FillBuffer(RingBufferError),
 7    #[error("Completion buffer operation failure: {0}")]
 8    CompletionBuffer(RingBufferError)
 9}
10
11// NEW! added more const generic values as well ass fill and completion buffers
12
13/// A safe wrapper around a page-aligned UMEM buffer along with its associated fill and completion
14/// buffers.
15struct Umem<const COUNT: usize, const SIZE: usize, const FILL: usize, const COMP: usize> {
16    buffer: SharedBuffer<[u8; SIZE], COUNT>,
17    fill: FillBuffer<FILL>,
18    comp: CompletionBuffer<COMP>,
19}
 1// NEW! add offsets argument to create ring buffers
 2pub fn new(fd: &XdpFd, offsets: xdp_mmap_offsets) -> Result<Self, UmemError> {
 3    // previous code omitted for brevity
 4
 5    // after buffer registration
 6    let fill = FillBuffer::new(fd, offsets).map_err(UmemError::FillBuffer)?;
 7    let completion = CompletionBuffer::new(fd, offsets).map_err(UmemError::CompletionBuffer)?;
 8
 9    Ok(Umem { buffer, fill, completion })
10}

Now we can add the RX and TX buffers into our XdpSock struct and update the UMEM buffer code in the struct to accommodate the changes we made.

1// inside impl XdpSock
2const RING_LEN: usize = 1024;
 1#[derive(Debug, Error)]
 2pub enum XdpSockError {
 3    // previous variants omitted for brevity
 4
 5    #[error("Rx buffer operation failed: {0}")]
 6    RxError(RingBufferError),
 7    #[error("Tx buffer operation failed: {0}")]
 8    TxError(RingBufferError),
 9}
10
11pub struct XdpSock {
12    /// The XDP socket file descriptor
13    fd: XdpFd,
14    /// The UMEM buffer, fill and completion buffers
15    umem: Umem<
16        { Self::UMEM_CHUNK_COUNT },
17        { Self::UMEM_CHUNK_SIZE },
18        { Self::RING_LEN },
19        { Self::RING_LEN },
20    >,
21    /// The RX buffer
22    rx: RxBuffer<{ Self::RING_LEN }>,
23    /// The TX buffer
24    tx: TxBuffer<{ Self::RING_LEN }>,
25}
 1// NEW! helper function to get the offsets struct
 2
 3/// Attempts to get a [libc::xdp_mmap_offsets] struct from the kernel
 4///
 5/// # Returns
 6/// The struct if successful or an [std::io::Error] on failure
 7fn get_ring_offsets(fd: &XdpFd) -> Result<libc::xdp_mmap_offsets, ioError> {
 8    // Safety:
 9    // This struct is composed of simple integer offsets with invariants to maintain, its
10    // content must be filled out by the kernel
11    let mut offsets: libc::xdp_mmap_offsets = unsafe { std::mem::zeroed() };
12
13    // Safety:
14    // XDP_MMAP_OFFSETS is a valid operation and offsets is valid type and value to pass to it
15    // for initialization
16    unsafe { getsockopt(fd, libc::XDP_MMAP_OFFSETS, &mut offsets) }?;
17
18    Ok(offsets)
19}
 1// inside XdpSock::new()
 2
 3// ...
 4
 5// Create ring buffers
 6let offsets = get_ring_offsets(&fd).map_err(XdpSockError::RingOffsets)?;
 7
 8// creates the umem buffer, fill and completion rings
 9let umem = Umem::new(&fd, offsets).map_err(XdpSockError::Umem)?;
10
11let rx = RxBuffer::new(&fd, offsets).map_err(XdpSockError::RxError)?;
12let tx = TxBuffer::new(&fd, offsets).map_err(XdpSockError::TxError)?;
13
14Ok(Self {
15    fd,
16    umem,
17    rx,
18    tx,
19    fill: fr,
20    completion: cr,
21    nic,
22})

Buffer Specifics

So far we’ve seen that all ring buffers have a flags section, some ring buffers (TX and RX) also have an options value in their entries. The meaning of these fields changes depending on which buffer we’re talking about, so let’s talk about it!

Fill Buffer

Flags

If the XDP_RING_NEED_WAKEUP flag is set for this buffer, the driver will need to be kicked awake in order for it to actually start consuming entries from the fill buffer. This can be done using the following recvfrom syscall.

1// c code
2// fd is an AF_XDP fd
3recvfrom(fd, NULL, 0, MSG_DONTWAIT, NULL, NULL);

RX Buffer

Options

The options field in an RX buffer entry is used by the kernel to inform us that a packet is a multipart packet. Multipart packets are packets that are broken up over multiple UMEM buffer chunks. This can be because they are too big to fit into a single chunk, because they were purposely split despite not being too large (e.g. by using IPv4 fragmentation), or some other reason. It does this by setting the XDP_PKT_CONTD flag for the first n-1 entries in a a packet that is split over n many entries. The existence of this flag indicates that the next entry after the current one contains information for the same packet and is a continuation of the current entry.

Handling such packets is optional and something you must opt-into. This is further discussed in when talking about the XDP_USE_SG flag, where the behavior is explained in more detail.

TX Buffer

Flags

Like the fill buffer, the TX buffer can also have the XDP_RING_NEED_WAKEUP flag enabled, meaning that you might need to kick the driver awake to have it consume entries. This can be done by using a sendto syscall.

1// fd is an AF_XDP fd
2sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);

Options

The options field in a TX buffer has two possible flags: XDP_PKT_CONTD and XDP_TX_METADATA.

As described in the RX buffer specifics XDP_PKT_CONTD marks a packet as a multipart packet, which tells the kernel that the next entry in the ring buffer points to a chunk which is a continuation of this packet. This flag should be set for the first n-1 entries in a packet that is split over n entries. Unlike the RX buffer, you do not need to opt-in to use this flag in the TX buffer.

The XDP_TX_METADATA flag may only be used if you set the tx_metadata_len field appropriately when registering the UMEM buffer. This flag lets the kernel know that it includes or requests transmission metadata, and that it should look at the first sizeof(tx_metadata_len) of the chunk. This of course means that we have to put something in that section which leads us to

Note: when using this flag with multipart packets only the first packet in the chain should have this flag set and only the first packet should leave space for the metadata. The other packets in the chain should not have this flag set, and the addresses they point to should be chunk aligned.

xsk_tx_metadata flags

In the UMEM registration section we talked about how the minimum value for tx_metadata_len is sizeof(struct xsk_tx_metadata). This is, as you might have guessed, because those bytes actually represent a xsk_tx_metadata struct, which looks like this:

 1// c code
 2struct xsk_tx_metadata {
 3	__u64 flags;
 4	union {
 5		struct {
 6			/* XDP_TXMD_FLAGS_CHECKSUM */
 7
 8			/* Offset from desc->addr where checksumming should start. */
 9			__u16 csum_start;
10			/* Offset from csum_start where checksum should be stored. */
11			__u16 csum_offset;
12
13			/* XDP_TXMD_FLAGS_LAUNCH_TIME */
14			/* Launch time in nanosecond against the PTP HW Clock */
15			__u64 launch_time;
16		} request;
17
18		struct {
19			/* XDP_TXMD_FLAGS_TIMESTAMP */
20			__u64 tx_timestamp;
21		} completion;
22	};
23};

Yeah that’s right, your flags have flags now. Before we touch on the flags though, let’s talk layout. As you can see there are two sub structs inside a union, one labeled request, and another labeled completion. The request field is used by us to request As their names indicate the former allows you to request that the driver perform certain tasks before transmission, while the latter requests metadata about the transmission upon completion.

To determine which of these the driver will consider, if any, we use the struct’s flags field. The flags are:

XDP_TXMD_FLAGS_CHECKSUM

This asks the driver to calculate layer 4 (transport layer, e.g. TCP, UDP) checksums for you. You will need to set values for csum_start and csum_offset to inform the device from where to start calculating the checksum and where to store it.

XDP_TXMD_FLAGS_LAUNCH_TIME

Requests that the device wait at least until the provided timestamp to send the packet. You will need to provide the time by setting a value for launch_time. Using the PTP clock is out of scope for this article, see relevant kernel.org docs for info.

XDP_TXMD_FLAGS_TIMESTAMP

Requests the device for the timestamp of when the transmission is complete. The device will place this timestamp in tx_timestamp.

The Process

Before publishing the entry to the TX buffer you would set the flags you want enabled. Then you would fill out data in the request portion of the of the union if any of the flags you set warrant it. When the chunk gets handed back to you in the completion buffer you may read from the completion portion of the union if the flags you set permit it.

Driver Support

Finally, all, some, or none of these flags may be supported by your NIC. To programmatically detect which flags are supported you must use Netlink’s netdev family to find this information about the interface you want to use. This information is out of the scope of this post, but you might want to look at the xsk-flags section of the netdev specs YAML

Binding It All Together

Now that we have all our components laid out we can finally make a bind system call to bind our xdp socket, UMEM buffer, and ring buffers to a specific network interface. To do that we will need to fill out another struct of course.

1// c code
2struct sockaddr_xdp {
3	__u16 sxdp_family; // set to AF_XDP
4	__u16 sxdp_flags; // binding options
5	__u32 sxdp_ifindex; // nic index
6	__u32 sxdp_queue_id; // nic queue id
7	__u32 sxdp_shared_umem_fd; // the xdp socket's fd
8};

Oh hey look at that more flags, let’s save those for last. The two values we don’t need to worry too much about are the sxdp_family and sxdp_shared_umem_fd fields. You can set the sxdp_family field to PF_XDP. sxdp_shared_umem_fd can be set to 0 unless you want to share the UMEM buffer amongst multiple file AF_XDP sockets, in which case see This flag option.

The sxdp_ifindex field is the NIC’s index. I won’t go into this too much as it is out of scope, but long story short use getifaddrs to get a list of all the NICs then pick one and get its index with if_nametoindex (or ioctl with SIOCGIFINDEX). Though not covered in this post you can find the full code linked at the bottom of this article does show how it can be done.

sxdp_queue_id is how you specify which queue on the NIC you’d like to use. Finding out how many queues there are and the IDs is out of scope for this write up. If you’re just looking to test stuff out you can set this to 0 and be fine. If you actually want to dig into how to use this field then you’re looking for the ETHTOOL_GCHANNELS ioctl request. You can read about it in the kernel.org docs.

Note if you want to be lazy: NIC queue IDs are just a 0-indexed mapping to queues, so on a 4 queue NIC there are IDs 0, 1, 2, and 3. This means you could just try creating a socket at IDs 0 to n and stop when it fails.

Flags

XDP_ZEROCOPY

This option forces zero-copy mode. There is a lot of nuance about the difference between zero-copy and copy mode. Basically, it boils down to this: “the NIC can use DMA to read and write directly from/to the UMEM buffer rather than copying data to/from user space.” For RX though you will need to attach an eBPF program running in native or offloaded mode to avoid creating an SKB, more on that later.

This is mutually exclusive with XDP_COPY

XDP_COPY

This option forces copy mode, and if you haven’t guessed it already it means that the NIC will have to copy memory to/from the UMEM buffer.

One more thing regarding the XDP_ZEROCOPY and XDP_COPY flags: you don’t have to use either. These are only useful if you want to force a specific mode. The kernel will, if neither flag is set, attempt to bind using zero-copy mode, and if not available, will attempt to bind with copy mode. Using these flags tells the kernel to attempt binding in the specified mode only and to have the operation fail otherwise if binding with that mode fails.

Note: If you didn’t use either of the above flags you can use getsockopt with XDP_OPTIONS to fill out a struct xdp_options. The struct will tell you what mode the socket operates in by checking if the XDP_ZEROCOPY flag is enabled or not in the returned value.

XDP_SHARED_UMEM

This option allows you to share the same UMEM buffer with multiple XDP sockets. If you want to enable this then only do it with the first bind operation and pass 0 for sxdp_shared_umem_fd. All other sockets that you plan on sharing the buffer with must pass this flag along with the initial socket’s fd in the sxdp_shared_umem_fd field.

XDP_USE_NEED_WAKEUP

This flag indicates that the driver can sleep. The driver normally actively checks the TX and fill rings for new entries to see if work needs to be done, enabling this flag means you have to tell the driver when there is work to be done.

XDP_USE_SG

This option indicates that we can handle receiving multipart packet. As discussed in RX buffer specifics multipart packets are packets which are too large to fit into a single chunk in the UMEM buffer. Enabling this option will cause the kernel to put multipart packets in the RX buffer whose handling and behavior is discussed in the section linked earlier. If this flag is not enabled, then the kernel will drop any and all packets which are larger than the effective chunk size (chunk size minus headroom if any was specified).

Performing the Binding

So to perform the bind operation all you need to do is fill the struct and call bind as such

 1// where fd is an XdpFd
 2
 3let sockaddr = libc::sockaddr_xdp {
 4    sxdp_family: libc::AF_XDP as _,
 5    sxdp_flags: 0, // or whatever flags you choose
 6    sxdp_ifindex: ifindex, // whichever you choose
 7    sxdp_queue_id: 0, // or some other id
 8    sxdp_shared_umem_fd: fd.as_raw_fd() as _,
 9};
10
11let ret = unsafe {
12    libc::bind(
13        fd.as_raw_fd(),
14        &sockaddr as *const _ as *const _,
15        size_of::<libc::sockaddr_xdp>() as _,
16    )
17};
18
19if ret == 0 {
20    // it worked!
21} else {
22    // something went wrong :(
23}

Safe Wrapper

As we have done before we’ll create a safe wrapper around this this code to make it easier to work with.

 1impl XdpSock {
 2    /// Binds `fd` and its associated resources with the specified NIC by index.
 3    /// 
 4    /// # Returns
 5    /// An empty [Ok] value to represent success, or an [std::io::Error] if bind failed.
 6    fn bind(fd: &XdpFd, ifindex: u32) -> Result<(), ioError> {
 7        let sockaddr = libc::sockaddr_xdp {
 8            sxdp_family: libc::AF_XDP as _,
 9            sxdp_flags: 0,
10            sxdp_ifindex: ifindex,
11            sxdp_queue_id: 0,
12            sxdp_shared_umem_fd: fd.as_raw_fd() as _,
13        };
14
15        let ret = unsafe {
16            libc::bind(
17                fd.as_raw_fd(),
18                &sockaddr as *const _ as *const _,
19                size_of::<libc::sockaddr_xdp>() as _,
20            )
21        };
22
23        if ret == 0 {
24            Ok(())
25        } else {
26            Err(ioError::last_os_error())
27        }
28    }
29}
1#[derive(Debug, Error)]
2pub enum XdpSockError {
3    // previous variants omitted for brevity
4
5    #[error("Failed to bind xdp socket to network interface: {0}")]
6    BindError(ioError),
7}
 1// inside XdpSock::new()
 2
 3// ...
 4
 5// Bind it all together!
 6// to see how we got ifindex you can checkout the full code linked below
 7Self::bind(&fd, ifindex).map_err(XdpSockError::BindError)?;
 8
 9Ok(Self{
10    fd,
11    umem,
12    rx,
13    tx
14})

And that’s it! You just made an AF_XDP socket that you can use to read and write real packets! Congrats champ here’s a 🎉 and a 🎂 to celebrate.

Sending and Receiving Data

Alright that’s enough don’t let it get to your head, you haven’t actually sent or received anything over the wire yet. Let’s start with sending data since that one is the easier of the two.

Sending Data

The process here is simple you write data to a chunk you own, put its offset in the TX buffer and increment the producer. The kernel will hand ownership back to you in the completion buffer once it has sent it out. Here’s the packet we’re going to be sending it out:

1let data: &[u8; _] = b"\xb2X\xad&W\x16\xc4b7\x03\x01d\x08\x00E\x00\x00(\x00\x01\x00\x00@\x11\
2\xef\xfe\xc0\xa8\x04\xb9\xc0\xa8\x04\xbc\x9c\xc1c\xdd\x00\x14\xe2qhello world!";

It’s a UDP packet! Couldn’t you tell? It sends the message “hello world!” from my PC to the mini PC I have out in the living room!

Now obviously no one is actually handwriting bytes into a string or buffer to send them out, let alone hardcoding IP’s and MAC addresses into ’em. Realistically you would use packed structures matching the network protocols you’re using and you would fill out your buffer that way, but that is out of scope and overkill for demonstrating how to send a simple packet. The quick and dirty way is to just use scapy to have to it generate the packets for you, which is exactly what I did above

I’m going to be using the safe interface we wrote above to demonstrate how you might be able to send this packet, keep in mind that this is all hardcoded for the sake of demonstrating the operation. Any reasonable program should develop a cleaner way of performing these operations. This is also going to be the end of my providing sample safe wrappers that is left as an exercise to the reader from here on.

 1let chunk_start = 0;
 2// not *actual* end, but end of data we're using
 3let chunk_end = chunk_start + data.len();
 4
 5// Safety: we own the 0th chunk and therefore can take a reference to it.
 6unsafe { sock.umem.buffer.get_unchecked_mut(0) }.as_mut_slice()[chunk_start..chunk_end]
 7    .copy_from_slice(data);
 8
 9let idx = sock.tx.cached_prod as _;
10// Safety: we own the index at cached_prod by definition
11*unsafe { sock.tx.data.get_unchecked_mut(idx) } = libc::xdp_desc {
12    addr: chunk_start as _,
13    len: data.len() as _,
14    options: 0,
15};
16
17sock.tx.cached_prod += 1;
18sock.tx
19    .producer
20    .store(sock.tx.cached_prod, Ordering::Release);
21
22unsafe {
23    let ret = libc::sendto(
24        sock.fd.as_raw_fd(),
25        std::ptr::null(),
26        0,
27        libc::MSG_DONTWAIT,
28        std::ptr::null(),
29        0,
30    );
31
32    if ret < 0 {
33        println!("sendto error: {}", std::io::Error::last_os_error());
34    }
35}
36
37loop {
38    let new_prod = sock.umem.comp.producer.load(Ordering::Acquire);
39    if new_prod > sock.umem.comp.cached_prod {
40        sock.umem.comp.cached_prod = new_prod;
41        sock.umem.comp.consumer.store(new_prod, Ordering::Release);
42        break;
43    }
44}

And just like that you’ve submitted a packet to the driver which, assuming the packet is formatted correctly, should get sent to the specified destination.

XDP eBPF Programs

Note: If you only plan on sending data and you don’t want to receive any data, all you would need is the program below.

Before we can work on receiving packets, we need to figure out how to tell the kernel which packets to send our way. The way to do this is through attaching an XDP eBPF program to the NIC to have it examine each packet as it arrives, and decide its fate before the kernel’s network stack ever sees it.

I won’t go too deep into what eBPF is or its full capabilities. The long story short is that eBPF programs execute in kernel space and that they are event-driven, in our case the event is packet arrivals to the NIC. The program then gets loaded by some loader, like Aya or libbpf, and attached to the appropriate interface. One thing to note is that eBPF code can be very specific to the loader that is used, as such the Rust code in this section can look quite different from equivalent C code. Thankfully, this part is the most well documented of this whole process, so if you plan on using libbpf and/or writing your code in C it should be relatively easy to find resources to help.

For now though lets look at one of the simplest possible programs we can write and analyze it:

 1#![no_std]
 2#![no_main]
 3
 4use aya_ebpf::{bindings::xdp_action, macros::xdp, programs::XdpContext};
 5
 6#[xdp]
 7pub fn pass_all(_ctx: XdpContext) -> u32 {
 8    xdp_action::XDP_PASS
 9}
10
11#[unsafe(link_section = "license")]
12#[unsafe(no_mangle)]
13static LICENSE: [u8; 4] = *b"GPL\0";
14
15#[cfg(not(test))]
16#[panic_handler]
17fn panic(_info: &core::panic::PanicInfo) -> ! {
18    loop {}
19}

We first define this program as not including the standard library, as we cannot use it inside the kernel, which is where our program will run. We also tell the compiler that there is no main function as there is no defined entry symbol. This is because the loader (Aya) will handle loading programs and setting entry points appropriate depending on arguments passed at runtime.

We define a function, marked with the xdp macro, which takes in an XdpContext and returns a u32. In this case the macro puts the function in a specific section in the resulting binary, which is later used by the loader to locate the function. The returned integer is interpreted by the kernel/driver to determine the fate of the packet, behavior which is elaborated on further down.

We also define a section called “license”, referring to a static byte string. This is completely optional, however the presence of a license section with the value “GPL” allows access to helper functionality provided by the kernel which is tied up behind licensing restrictions.

Last, we define a panic handler. This is simply because one needs to exist and we lose the default one due to the no_std nature of our program. Of course our program should never panic, so this is simply here to keep the compiler happy as all Rust programs need to have one.

Behavior of a Program

An XDP program (read: function annotated with the xdp macro) will be called for each packet that arrives at the NIC. It will be given metadata regarding the packet as its only argument, and will be expected to return a u32 to decide what happens with the packet.

The metadata gives us information about the packet. In our case XdpContext holds the metadata, and is merely a safe thin wrapper around an xdp_md struct pointer, with xdp_md having the following layout:

 1// c code
 2struct xdp_md {
 3	__u32 data;
 4	__u32 data_end;
 5	__u32 data_meta;
 6	/* Below access go through struct xdp_rxq_info */
 7	__u32 ingress_ifindex; /* rxq->dev->ifindex */
 8	__u32 rx_queue_index;  /* rxq->queue_index  */
 9	__u32 egress_ifindex;  /* txq->dev->ifindex */
10};

The first three members, despite being u32s, are actually pointers to the start, end, and metadata for the packet. ingress_ifindex and rx_queue_index define the NIC index and queue index tuple. egress_ifindex is used in the cases where a packet is being transmitted, however that is exclusive to a different kind of program/mapping so we won’t be talking about it here.

The returned integer is how the program communicates what it decides the fate of the packet will be. An XDP program can return any one of these values:

Do note that the program is allowed to modify the packets that it gets. It would be perfectly valid for example to intercept some UDP packet and to change which port it is being sent to then pass it onto the kernel network stack for example.

Mapping Packets To AF_XDP

Creating a map is thankfully very easy with Aya, all we have to do is add this at the top level.

1use aya_ebpf::{macros::map, maps::XskMap};
2
3#[map]
4static XSK_MAP: XskMap = XskMap::with_max_entries(64, 0);

An XskMap is an array-based map, so this map in particular will map u32 -> u32. The value a key maps to should a socket fd (if it maps to anything at all).

 1#![no_std]
 2#![no_main]
 3
 4use aya_ebpf::{
 5    bindings::xdp_action,
 6    macros::{map, xdp},
 7    maps::XskMap, programs::XdpContext
 8};
 9
10#[map]
11static XSK_MAP: XskMap = XskMap::with_max_entries(64, 0);
12
13#[xdp]
14pub fn pass_all(ctx: XdpContext) -> u32 {
15    XSK_MAP.redirect(get_queue_idx(&ctx), 0) // queue_id then flags
16        .unwrap_or(xdp_action::XDP_PASS)
17}
18
19// There is a builtin method for this in aya_ebpf but it hasn't made it to a release version yet
20#[inline(always)]
21fn get_queue_idx(ctx: &XdpContext) -> u32 {
22    unsafe { *ctx.ctx }.rx_queue_index
23}
24
25#[cfg(not(test))]
26#[panic_handler]
27fn panic(_info: &core::panic::PanicInfo) -> ! {
28    loop {}
29}
30
31#[unsafe(link_section = "license")]
32#[unsafe(no_mangle)]
33static LICENSE: [u8; 4] = *b"GPL\0";

Note: on error redirect will return the lower 2 bits of the flags argument. This is a relic of the C API, which while useful when in C, is less ergonomic in Rust. Such a flag could be used to return a default value of sorts in the case of redirection failure without having to write error handling code. However, since using Rust affords us the luxury of using the Result struct, we get a Result<u32, u32>. This means we must handle the error mapping anyway, and such relying on the flag value is unnecessary, and can written into the form above.

The above code structured based on the assumption that:

  1. There is only 1 socket bound per queue
  2. The value associated with the ith key is the socket mapped to the ith queue
  3. The NIC has n many queues such that n <= 64

The program will attempt to forward any packets which arrives on the ith queue to a socket mapped to the ith map entry. If such a socket does not exist it returns XDP_PASS.

Setting Map Entries

Before we call it a day with eBPF however, we still need to set entries in the map in order for it to know about our socket. Thankfully this step is relatively easy, the Aya code boils down to this:

 1// Using our fancy safe struct from above!
 2let mut sock = match XdpSock::new() {
 3    Ok(s) => s,
 4    Err(e) => {
 5        error!("Failed to create socket: {e}");
 6        return;
 7    }
 8};
 9
10// load the object file
11let mut bpf = aya::Ebpf::load(aya::include_bytes_aligned!(concat!(
12        env!("OUT_DIR"),
13        "/ebpf_prog"
14    )))
15    .expect("Failed to construct ebpf instance");
16
17// find the program named "pass_all"
18let program: &mut Xdp = bpf.program_mut("pass_all")
19    .expect("Failed to find program")
20    .try_into()
21    .expect("Failed to convert program to Xdp program");
22// load the actual program
23program.load()
24    .expect("Failed to load program");
25// attach it to the nic that the socket is attached to!
26program.attach(&sock.nic.name, XdpFlags::default())
27    .expect("Failed to attach program to NIC");
28
29let mut map = XskMap::try_from(bpf.map_mut("XSK_MAP").expect("Failed to load map"))
30        .expect("Failed to turn map into hashmap");
31
32map.set(0, sock.fd.as_raw_fd(), 0) // key, value, flags
33    .expect("Failed to insert queue -> xsk mapping");

Again I’m glossing over this since this isn’t an Aya tutorial and eBPF is a whole beast in and of itself. Documentation for both Aya and libbpf is excellent so hopefully that isn’t an issue.

Receiving Data

We’re finally at the end of our journey, and we can receive a packet! The following snippet will appear just below the one above where we set the map entries.

 1// Safety: we own the 0th chunk so we can take a reference to it
 2*unsafe { sock.umem.fill.data.get_unchecked_mut(0) } = 0;
 3sock.umem.fill.cached_prod += 1;
 4sock.umem
 5    .fill
 6    .producer
 7    .store(sock.umem.fill.cached_prod, Ordering::Release);
 8
 9loop {
10    let new_prod = sock.rx.producer.load(Ordering::Acquire);
11    if new_prod > sock.rx.cached_prod {
12        sock.rx.cached_prod = new_prod;
13        // we can now read the contents of the 0th entry in the rx buffer to get info about the
14        // packet in the 0th chunk of the umem buffer
15        sock.rx.consumer.store(new_prod, Ordering::Release);
16        // but not any more! we give up ownership once we release
17        break;
18    }
19}

Nice and simple right? We write an entry into the fill buffer telling the kernel to fill the 0th entry in the UMEM buffer with the first packet which comes into the queue we’ve bound our socket to. Once the producer of the RX buffer is incremented, we know that the kernel has finished receiving a packet and is handing it to us. After which, we read from the RX ring entry to obtain the chunk offset and length of the packet. After we are done we increment the consumer indicating that we no longer need that descriptor entry and that the producer can reuse that slot.

Conclusion

Now that you’ve got all of this setup you should be ready to go on and start sending and receiving packets faster than you ever thought you could! If you want to see all the code from the above sections strung together into one program that sends and receives data you can find it here:

https://github.com/nahla-nee/af_xdp-example

All code shown in this post, as well as the code linked above is licensed under the GPL 2 or later license.

And if you have some unexpected trouble along the way read on for some:

Driver Exceptions

Oh what, you thought you were safe? Just because you hit the conclusion?

Seriously, there’s a lot of deviation in behavior depending on what NIC you use on account of this whole shtick being based around driver implemented support. This section is here to document some of this behavior and is mostly regarding Intel drivers (specifically igb and ixgbe). I did recently buy a ConnectX-6 Dx based NIC, and while the mlx5_core implementation seems much better than Intel’s it does also have some things to talk about. You can get around some of these issues if you use copy mode, but it won’t solve all your problems and you’ll be losing out on a good bit of performance.

Testing was performed with the following NIC/driver pairings:

As far as I can tell large parts of Intel’s NIC driver code is shared between their various drivers, so I would assume that the statements I make here regarding my Intel experience would apply to all drivers. Do note that only the ixgbe driver supports zero-copy mode, so any remarks on zero-copy behavior come from testing only that driver.

Do I Need RX/Fill or an eBPF Program If I Only Want to Send Data?

This is somewhat driver dependent as far as I can tell. In copy mode it seems that you can get away with not having them. However as far as I can tell it seems that all three are required when running in zero-copy mode, which makes sense if you dig into the gory details of it. Magnus Karlsson* answered a question regarding Intel drivers, which confirms that all of their zero-copy implementations require eBPF. You can find his answer in this github issue.

* for those unfamiliar: Magnus is a Linux networking kernel engineer working at Intel, and is a (the?) maintainer for AF_XDP. He also does regular work on XDP, eBPF, and the kernel’s network stack.

My Existing Connections Hang/I Can’t Make New Connections When I Use XDP

Some drivers, though probably all, have to do a soft-restart of sorts when you attach and XDP program due to them need to reconfigure some of their queues which can interrupt connectivity. It is also likely that DHCP will be initiated. This is normal and expected, sometimes those connections can recover, sometimes not.

Sources

As mentioned at the start of this post, many resources online have errors or are lacking in some information. Nevertheless, I do want to cite my sources and give credit where credit is due. Do note that my writing will contradict some of the information that these sources provide, and some of the information you find below might be incorrect or outdated.

Thanks to u/arctic-alpaca for pointing out UB in a previous version of this article