netfilter project logo people.netfilter.org developer blogs

Patrick McHardy's blog

Wed, 20 Aug 2008

nftables


After almost three weeks of silence, following is an update on the successor for iptables I've been working on. For lack of creativity, its called nftables so far, but I might still change that. Its writen from scratch and has lots of differences to the old design, so I'll start with a description from the bottom up of what happens in the kernel.

Don't read this if you're attending the netfilter workshop or you will be bored :)

The userspace interface

The userspace interface is, of course, based on netlink (nfnetlink to be precise). Three main object types exist: tables, chains and rules. As with iptables, tables are just containers for chains that belong to the same protocol family. Unlike with iptables, only tables that need special functionality (like NAT or mangle, now called route) are implemented in the kernel, normal tables can be created and deleted by userspace. Chains exist in two flavours, a base chain is a chain that registers a netfilter hook for receiving packets and is an entry point for a table. A table can contain an arbitary amount of base chains, even multiple for the same hook. There are no default counters for base chains anymore, which allows to only register the hooks when rules are present in the chain without userspace visible impact. Regular chains are just containers for rules and are similar to iptables chains. A rule is a container that contains multiple expressions and statements for matching packets and performing actions.

Expressions

One of my major dislikes with iptables was the huge amount of modules for performing very similar tasks (match on TCP ports, match on UDP ports, match on DCCP ports, match on mark value, match on connmark value. ...) with often slightly different feature sets (support for negation, support for ranges, ...) and the fact that every match or target available to the user required a kernel module implementing exactly that functionality. This is all gone :)

There is only one kind of extension, called an expression. It can represent statements (like log, verdicts), unary expressions (get TCP port), binary expressions (compare data) or (optionally runtime parameterized) targets. They communicate with the core or other expressions through a register stack. The register stack contains a number of general purpose registers and a special verdict register, which can be used to change control flow in the classification core. A single data type is used to represent any kind of data, be it ports, meta data, data gathered from stateful modules, or even verdicts, including jumps. Userspace is responsible for chaining expressions appropriately to get the desired semantic.

The modules implemented so far are:

  • A payload gathering module: this module loads payload data at a specified offset and length from the packet into a userspace specified register.
  • A meta data gathering module: this module loads meta data like mark, realm, ... into a userspace specified register.
  • A constant data module: this module loads data provided by userspace in a userspace specified register. Mainly used for returning verdicts.
  • A counter module: this modules counts bytes and packets.
  • A module for simple relational expressions: implements equality and relational expressions. When expression is false a verdict to abort evaluation of the current rule (NFT_BREAK) is returned.
  • Multiple modules for performing set lookups: these module take data from a userspace specified register and look it up in either a rbtree, a hash or a bitmap (depending on the range of the data and whether the set should be manipulable). When the value is not found, they return NFT_BREAK by default, which can be used for representing matches. They can alternatively load data associated with the key into a userspace defined register, which allows to implement data and verdict dictionaries.
  • Some additional modules like a NAT module, that are not particulary relevant for describing the new architecture.

Tieing it up

Here are a few examples of what this can be used for. Don't worry, the userspace interface presents this all in a nice fashion and users don't have to care about registers, offsets etc. :)

  • An exact match on the IP destination address:

    Syntax: ip daddr 192.168.0.1
        [ payload load 4 offset network header + 16 => reg 1 ]
        [ compare reg 1 192.168.0.1 ]
       
  • A match on a set of IP destination addresses:

    Syntax: ip daddr { 192.168.0.1, 192.168.0.2}
        [ payload load 4 offset network header + 16 => reg 1 ]
        [ set lookup reg 1 { 192.168.0.1, 192.168.0.2 } ]
       
  • A verdict dictionary for mapping the IP destination address to verdicts:

    Syntax: ip daddr { 192.168.0.1 => jump chain1, 192.168.0.2 => drop, 192.168.0.3 => jump chain2 }
        [ payload load 4 offset network header + 16 => reg 1 ]
        [ set lookup reg 1 load result in verdict register
          { "192.168.0.1" : jump chain1,
            "192.168.0.2" : drop,
            "192.168.0.3" : jump chain2 } ]
       
  • A generic dictionary used for mapping IP destination addresses to netfilter marks:

    Syntax: meta mark map ip daddr { 192.168.0.1 => 0x1, 192.168.0.2 => 0x5, 192.168.0.3 => 0x2 }
        [ payload load 4 offset network header + 16 => reg 1]
        [ set lookup reg 1 load result in reg1
          { "192.168.0.1" : 0x1,
            "192.168.0.2" : 0x5,
            "192.168.0.3" : 0x2 } ]
        [ meta mark reg 1 ]
       

More cool stuff - multidimensional exact matches in O(1)

This feature is called concatentations, which allows to dynamically concatenate multiple keys and use them for a lookup. This allows to do multidimensional exact matches in constant time when combined with a hash based set. This example shows how to use concatenations for filtering on (mac, ip saddr) combinations for antispoofing:

Syntax: mac src . ip saddr { 00:1b:21:02:6f:ad . 192.168.0.100, 00:01:36:0d:d0:71 . 192.168.0.1 }

The kernel side implementation is not done yet though :) There are two ways how to represent this, either by specifying offsets with data loads:

  [ payload load 6 offset linklayer header + 6 => reg 1 ]
  [ payload load 4 offset network header + 12  => reg 1 offset 6 ]
  [ set lookup reg 1
    { 00:1b:21:02:6f:ad . 192.168.0.100,
      00:01:36:0d:d0:71 . 192.168.0.1 } ]
 
alternatively, concatenation may be implemented as separate operation:
  [ payload load 6 offset linklayer header + 6 => reg 1 ]
  [ payload load 4 offset network header + 12  => reg 2 ]
  [ concat reg 1, reg 2 => reg 1 ]
  [ set lookup reg 1
    { 00:1b:21:02:6f:ad . 192.168.0.100,
      00:01:36:0d:d0:71 . 192.168.0.1 } ]
 
The first way would be preferrable, but it mainly depends on how much overhead this adds for the much more common case of not using concatentations in a rule.

Besides the more powerful expressions listed above, there are a few simpler ones as well, offering basic binary and logical operations etc. So far not many target modules exist, but some things that have been going through my head:

  • Parameterized hashlimit: hashlimit based on user-defined keys, something like
    hashlimit ip saddr / 24 . ip daddr / 24
    for instantiating a limit state for each combination of /24 networks talking to each other.
  • Parameterized NAT target: could be used to represent masquerading:
    snat ifaddr meta oif
    or NAT pools:
    snat map ip daddr & 0xf { 0 => addr1, 1 => addr2, 2 => addr3, ... }
    or randomized NAT pools:
    snat map random & 0xf { 0 => addr1, 1 => addr2, 2 => addr3, ... }

Userspace

Netlink communication

Low level netlink communication is entirely implemented in libnl. It allows to create tables, chains, rules, expressions and data, but doesn't know anything about higher level constructs.

The userspace frontend (nftables)

This is what the users actually interact with and where the intelligence lies. Since its still very much in flux, only a few bullet points.

From textual representation to the kernel

The parser is bison based with a real grammar. I might replace the parser later on since bison has its own set of problems, but using it during development is very useful for quickly changing grammar and verifying that its non ambiguous. It performs basic semantical validation and constructs a syntax tree, which is then post-processed. So far post-processing consists of:

  • More semantical validation for invalid constructs that can't be determined during parsing.
  • Byteorder conversions: constant values have their byteorder converted to match the byteorder of dynamically gathered data.
  • Type checks and conversions: types are checked for compatibility, when necessary (and possible) constant values are promoted or demoted to match the type of dynmically gathered data.
  • Dependency generation: some expressions have dependencies that are generated automatically when possible to save the user from unnecessary complications. This mainly affects higher level protocol matches, like "tcp dport 22", which requires to make sure the transport layer protocol is actually TCP, so the dependency is "ip protocol tcp".
There are still a few important things missing, like constant folding and propagating binary, logical and arithmetic operations from dynamically gathered data to the constant side of a relational expression, when possible.

The post processed tree is then linearized and fed into libnl. During linearization register allocation is performed to propagate values as required by an expression.

Reconstructing textual representation

A very important feature, one that is missing from all other filters that are built similar in the kernel (like BPF, TC u32 filter, ...), is reconstruction of high level constructs from the representation within the kernel. TC u32 for example allows you to specify "ip daddr X", but when dumping the filter rules it will just display an offset and length.

So when dumping a ruleset, nftables will reverse the steps performed during post-processing and linearization, which means reconstructing the syntax tree based on how expressions are chained, reconstructing high level meaning of payload and other expressions, eliminating redundant dependency expressions, etc. This works pretty well and the dump output currently looks exactly like the user specified input. This will not be 100% possible in the future though since f.i. constant folding is not reversible.

Ideas ...

A few ideas what else might be done in userspace. Most of it isn't planned for an initial release though.

  • Since we have an abstract representation of the entire ruleset, we can easily determine different match dimensions dynamically. This allows to perform optimization on a ruleset (or maybe chain) level by reordering rules, combining similar matches into sets, etc. It also allows to detect ineffective rules due to shadowing and eliminate them or warn the user.
  • Improved register allocation: so far register allocation is performed on a per rule level. This is necessary if individual rules may be added or deleted since we can't rely on register contents as set up by previous rules. With a small change to mark chains immutable however, redundant loads can be eliminated. Consider for example these two rules:
        udp dport 53 ...
        tcp dport 22 ...
       
    Both expression refer to the same payload (2b @ transport header + 2), but their preconditions differ (ip protocol 17 vs. ip protocol 6). Their representation looks like this:
        rule 1:
        [ payload load 1 offset network header + 9 => reg 1 ]
        [ compare reg 1 17 ]
        [ payload load 2 offset transport header + 2 => reg 1 ]
        [ compare reg 1 53 ]
    
        rule 2:
        [ payload load 1 offset network header + 9 => reg 1 ]
        [ compare reg 1 6 ]
        [ payload load 2 offset transport header + 2 => reg 1 ]
        [ compare reg 1 22 ]
       
    Both the protocol load and port load are redundant, so this can be written as:
        [ payload load 1 offset network header + 9 => reg 1 ]
        [ payload load 2 offset transport header + 2 => reg 2 ]
    
        rule 1:
        [ compare reg 1 17 ]
        [ compare reg 2 53 ]
    
        rule 2:
        [ compare reg 1 6 ]
        [ compare reg 2 22 ]
       
    or even more compact by using concatenations:
        [ payload load 1 offset network header + 9 => reg 1 ]
        [ payload load 2 offset transport header + 2 => reg 1 offset 1 ]
    
        rule 1:
        [ compare reg 1 17 . 53 ]
    
        rule 2:
        [ compare reg 1 6 . 22 ]
       
    which brings the number of expressions down to 1/2.

Copyright (C) 2001-2005 Patrick McHardy