Close the performance gap to kernel-bypass solutions
Operate on packets before being converted to SKBs
Work in concert with existing network stack without kernel modifications
Native Mode XDP
SKB or Generic Mode XDP
How is this connected?
Data-plane: inside kernel, split into:
Control-plane: Userspace
XDP puts no restrictions on how eBPF bytecode is generated or loaded
eBPF bytecode (and map-creation) all go-through BPF-syscall
This talk focus on approach used in $KERNEL_SRC/samples/bpf)
What are the basic XDP building block you can use?
eBPF program (in driver hook) return an action or verdict
How to cooperate with network stack
Think of XDP as a software offload layer for the kernel network stack
IP routing application is great example:
Similar concept could be extended to accelerate any kernel datapath
Add helpers instead of duplicating kernel data in eBPF maps!
How do you code these XDP programs?
SEC("xdp_drop_UDP") /* section in ELF-binary and "program_by_title" in libbpf */
int xdp_prog_drop_all_UDP(struct xdp_md *ctx) /* "name" visible with bpftool */
{
void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data; u64 nh_off; u32 ipproto = 0;
nh_off = sizeof(*eth); /* ETH_HLEN == 14 */
if (data + nh_off > data_end) /* <-- Verifier use this boundry check */
return XDP_ABORTED;
if (eth->h_proto == htons(ETH_P_IP))
ipproto = parse_ipv4(data, nh_off, data_end);
if (ipproto == IPPROTO_UDP)
return XDP_DROP;
return XDP_PASS;
}
Simple XDP program that drop all IPv4 UDP packets
static __always_inline
int parse_ipv4(void *data, u64 nh_off, void *data_end)
{
struct iphdr *iph = data + nh_off;
/* Note + 1 on pointer advance one iphdr struct size */
if (iph + 1 > data_end) /* <-- Again verifier check our boundary checks */
return 0;
return iph->protocol;
}
Simple function call parse_ipv4 used in previous example
Userspace program must call BPF-syscall to insert program info kernel
Luckily libbpf library written to help make this easier for developers
struct bpf_object *obj;
int prog_fd;
struct bpf_prog_load_attr prog_load_attr = {
.prog_type = BPF_PROG_TYPE_XDP,
.file = "xdp1_kern.o",
};
if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
return EXIT_FAILURE;
eBPF bytecode and map definitions from xdp1_kern.o are now ready to use and obj and prog_fd are set.
struct bpf_object *obj;
int prog_fd;
struct bpf_prog_load_attr prog_load_attr = {
.prog_type = BPF_PROG_TYPE_XDP,
.file = "xdp_udp_drop_kern.o",
};
if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd) == 0) {
const char *prog_name = "xdp_drop_UDP"; /* ELF "SEC" name */
struct bpf_program *prog;
prog = bpf_object__find_program_by_title(obj, prog_name);
prog_fd = bpf_program__fd(prog);
}
Possible to have several eBPF program in one object file
Now that a program is loaded (remember prog_fd set in the last snippet shown), attach it to a netdev
#include <"net/if.h"> /* if_nametoindex */
static __u32 xdp_flags = XDP_FLAGS_DRV_MODE /* or XDP_FLAGS_SKB_MODE */
static int ifindex = if_nametoindex("eth0");
if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
printf("link set xdp fd failed\n");
return EXIT_FAILURE;
}
If bpf_set_link_xdp_fd() is successful, the eBPF program in xdp1_kern.o is attached to eth0 and program runs each time a packet arrives on that interface.
eBPF maps are created when a program is loaded. In this definition the map is an per-cpu array, but there are a variety of types.
struct bpf_map_def SEC("maps") rxcnt = {
.type = BPF_MAP_TYPE_PERCPU_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(long),
.max_entries = 256,
};
While executing eBPF program rxcnt map can be accessed like this:
long *value;
u32 ipproto = 17;
value = bpf_map_lookup_elem(&rxcnt, &ipproto);
if (value)
*value += 1; /* We saw a UDP frame! */
/* BPF_MAP_TYPE_PERCPU_ARRAY maps does not need to sync between CPUs
* if using BPF_MAP_TYPE_ARRAY use __sync_fetch_and_add(value, 1); */
Since eBPF maps can be used to communicate information (statistics in this example) between the eBPF program easily. First locate the map:
struct bpf_map *map = bpf_object__find_map_by_name(obj, "rxcnt");
if (!map) {
printf("finding a map in obj file failed\n");
return EXIT_FAILURE;
}
map_fd = bpf_map__fd(map);
Map file descriptor (map_fd) needed to interactive with BPF-syscall
Now the map contents can be accessed via map_fd like this:
unsigned int nr_cpus = bpf_num_possible_cpus();
__u64 values[nr_cpus];
__u32 key = 17;
__u64 sum = 0;
int cpu;
if (bpf_map_lookup_elem(map_fd, &key, &value))
return EXIT_FAILURE;
/* Kernel return memcpy version of counters stored per CPU */
for (cpu = 0; cpu < nr_cpus; cpu++)
sum += values[cpu];
printf("key %u value %llu\n", key, sum);
Userspace would sum counters per CPU This allows eBPF kernel program to run faster since not using atomic ops
XDP redirect is powerful
XDP action code XDP_REDIRECT is flexible
Remember use helper: bpf_redirect_map to activate bulking
The devmap: BPF_MAP_TYPE_DEVMAP
The cpumap: BPF_MAP_TYPE_CPUMAP
AF_XDP - “xskmap”: BPF_MAP_TYPE_XSKMAP
Deep dive into the code behind XDP
Extending a driver with XDP support:
while (desc_in_rx_ring && budget_left--) {
action = bpf_prog_run_xdp(xdp_prog, xdp_buff);
/* helper bpf_redirect_map have set map (and index) via this_cpu_ptr */
switch (action) {
case XDP_PASS: break;
case XDP_TX: res = driver_local_xmit_xdp_ring(adapter, xdp_buff); break;
case XDP_REDIRECT: res = xdp_do_redirect(netdev, xdp_buff, xdp_prog); break;
/*via xdp_do_redirect_map() pickup map info from helper */
default: bpf_warn_invalid_xdp_action(action); /* fallthrough */
case XDP_ABORTED: trace_xdp_exception(netdev, xdp_prog, action); /* fallthrough */
case XDP_DROP: res = DRV_XDP_CONSUMED; break;
} /* left out acting on res */
}
/* End of napi_poll call do: */
xdp_do_flush_map(); /* Bulk size chosen by map, can store xdp_frame's for flushing */
driver_local_XDP_TX_flush();
Bulk via: helper bpf_redirect_map + xdp_do_redirect + xdp_do_flush_map
XDP put certain restrictions on RX memory model
Not supported: drivers that split frame into several memory areas
Recent change: Memory return API
This allows driver to implement different memory models per RX-queue
Also opportunity to share common RX-allocator code between drivers
Thanks to all contributors