| About Subscribe Categories Archives |
Patrick McHardy's blogWed, 20 Aug 2008nftables
After almost three weeks of silence, following is an update on the successor
for iptables I've been working on. For lack of creativity, its called nftables so
far, but I might still change that. Its writen from scratch and has lots of
differences to the old design, so I'll start with a description from the bottom up
of what happens in the kernel.
The userspace interfaceThe userspace interface is, of course, based on netlink (nfnetlink to be precise). Three main object types exist: tables, chains and rules. As with iptables, tables are just containers for chains that belong to the same protocol family. Unlike with iptables, only tables that need special functionality (like NAT or mangle, now called route) are implemented in the kernel, normal tables can be created and deleted by userspace. Chains exist in two flavours, a base chain is a chain that registers a netfilter hook for receiving packets and is an entry point for a table. A table can contain an arbitary amount of base chains, even multiple for the same hook. There are no default counters for base chains anymore, which allows to only register the hooks when rules are present in the chain without userspace visible impact. Regular chains are just containers for rules and are similar to iptables chains. A rule is a container that contains multiple expressions and statements for matching packets and performing actions. ExpressionsOne of my major dislikes with iptables was the huge amount of modules for performing very similar tasks (match on TCP ports, match on UDP ports, match on DCCP ports, match on mark value, match on connmark value. ...) with often slightly different feature sets (support for negation, support for ranges, ...) and the fact that every match or target available to the user required a kernel module implementing exactly that functionality. This is all gone :) There is only one kind of extension, called an expression. It can represent statements (like log, verdicts), unary expressions (get TCP port), binary expressions (compare data) or (optionally runtime parameterized) targets. They communicate with the core or other expressions through a register stack. The register stack contains a number of general purpose registers and a special verdict register, which can be used to change control flow in the classification core. A single data type is used to represent any kind of data, be it ports, meta data, data gathered from stateful modules, or even verdicts, including jumps. Userspace is responsible for chaining expressions appropriately to get the desired semantic. The modules implemented so far are:
Tieing it upHere are a few examples of what this can be used for. Don't worry, the userspace interface presents this all in a nice fashion and users don't have to care about registers, offsets etc. :)
More cool stuff - multidimensional exact matches in O(1)
This feature is called concatentations, which allows to dynamically concatenate multiple
keys and use them for a lookup. This allows to do multidimensional exact matches in
constant time when combined with a hash based set. This example shows how to use
concatenations for filtering on (mac, ip saddr) combinations for antispoofing:
[ payload load 6 offset linklayer header + 6 => reg 1 ]
[ payload load 4 offset network header + 12 => reg 1 offset 6 ]
[ set lookup reg 1
{ 00:1b:21:02:6f:ad . 192.168.0.100,
00:01:36:0d:d0:71 . 192.168.0.1 } ]
alternatively, concatenation may be implemented as separate operation:
[ payload load 6 offset linklayer header + 6 => reg 1 ]
[ payload load 4 offset network header + 12 => reg 2 ]
[ concat reg 1, reg 2 => reg 1 ]
[ set lookup reg 1
{ 00:1b:21:02:6f:ad . 192.168.0.100,
00:01:36:0d:d0:71 . 192.168.0.1 } ]
The first way would be preferrable, but it mainly depends on how much
overhead this adds for the much more common case of not using
concatentations in a rule.
Besides the more powerful expressions listed above, there are a few simpler ones as well, offering basic binary and logical operations etc. So far not many target modules exist, but some things that have been going through my head:
UserspaceNetlink communicationLow level netlink communication is entirely implemented in libnl. It allows to create tables, chains, rules, expressions and data, but doesn't know anything about higher level constructs. The userspace frontend (nftables)This is what the users actually interact with and where the intelligence lies. Since its still very much in flux, only a few bullet points. From textual representation to the kernelThe parser is bison based with a real grammar. I might replace the parser later on since bison has its own set of problems, but using it during development is very useful for quickly changing grammar and verifying that its non ambiguous. It performs basic semantical validation and constructs a syntax tree, which is then post-processed. So far post-processing consists of:
The post processed tree is then linearized and fed into libnl. During linearization register allocation is performed to propagate values as required by an expression. Reconstructing textual representationA very important feature, one that is missing from all other filters that are built similar in the kernel (like BPF, TC u32 filter, ...), is reconstruction of high level constructs from the representation within the kernel. TC u32 for example allows you to specify "ip daddr X", but when dumping the filter rules it will just display an offset and length. So when dumping a ruleset, nftables will reverse the steps performed during post-processing and linearization, which means reconstructing the syntax tree based on how expressions are chained, reconstructing high level meaning of payload and other expressions, eliminating redundant dependency expressions, etc. This works pretty well and the dump output currently looks exactly like the user specified input. This will not be 100% possible in the future though since f.i. constant folding is not reversible. Ideas ...A few ideas what else might be done in userspace. Most of it isn't planned for an initial release though.
|