netfilter project logo people.netfilter.org developer blogs

Patrick McHardy's blog

Mon, 13 Oct 2008

iptables 1.4.2 / 2.6.28 merges


I've just released iptables 1.4.2, the iptables release for Linux 2.6.27. Appologies for the delay (I should have released it about two weeks ago), the netfilter workshop and the opening of the 2.6.28 merge window have been keeping me busy.

The most significant change in iptables 1.4.2 are the scalability improvements from Jesper Dangaard Brouer. Besides that there have been a lot of smaller changes and updates, check out the changelog for all the details. I'll soon release a 1.4.3-rc1 after merging the patches adding support for the new features in 2.6.28.

On the kernel front, the networking merge window for 2.6.28 is closed now, only bugfixes are going in from this point forward. The netfilter-related changes for 2.6.28 include:

  • A big network namespace update: the major netfilter components are fully covered now and netfilter network namespace has been enabled. In case someone wants to play with it, the container guys have released some userspace tools.
  • Changes to make ebtables and arptables use the xtables infrastructure, accompanied by decoupling of netfilter families from real protocol families and changes to make matches and targets usable in any family without duplicating all the registrations.
  • After 6 years, TPROXY (the 6th incarnation if I'm not mistaken) is finally merged! It currently only supports IPv4, but according to Krisztian and Balazs, IPv6 support should be real easy to add. This means we'll hopefully soon also support transparent proxying for IPv6, which so far hasn't been possible because of a lack of NAT support. Thanks to Krisztian and Balazs for staying so persistent, I probably would have given up years ago.
Since the work on nftables ate up all my time, there are only three measly patches from myself, all of which are not even worth mentioning here. Speaking of which, I still haven't managed to release the nftables userspace code. I caught a flu last weekend and am not feeling to well, so it might be a couple more days.

Mon, 06 Oct 2008

Netfilter Workshop 2008


I've returned from the Netfilter Workshop yesterday and am slowly getting into shape again after an amazing, but also quite exhausting week in Paris.

The workshop was split into three parts:

  • The user day, which is open to the public and is meant to provide an opportunity for users to meet the developers and vice versa. Unfortunately I missed most of it because I arrived late.
  • The developer workshop: this is the main part of the workshop, with 20-25 participants, consisting of the core team, the major contributors and some heavy users, developing products based on netfilter.
  • The hacking days: two optional days at the end for presenting things in smaller rounds, talking about specific technical issues and so on.
We've had lots of interesting presentations about ongoing work during the developer workshop, there's good coverage on the INL workshop blog, so I won't repeat this here.

I held multiple presentations myself, starting with an overview of the netfilter developments since last year. All the presentations should be available on the workshop site soon, until then you can find it here. I was happy to learn that Jesper Brouer did some benchmarking of the RCU changes and hashing optimizations I made in 2.6.25 and they resulted in a huge performance boost on some heavily loaded systems, IIRC in the order of 50%.

The second presentation I gave was about my nftables work. There was a lot of valuable feedback and I have my work for the next months set. Some of the bigger items include:
  • Multi-family tables: often rulesets between IPv4 and IPv6 differ only slighly because only few rules actually contain address family specific matches, yet all the other rules have to be duplicated.
  • TC integration: there's a huge benefit in being able to use the same filter syntax and features, so I want to integrate it natively with TC. It shouldn't be much work, but it should be done before merging to make sure we don't run into userspace interface limitations.
  • Userspace API for operating on sets: currently sets are populated when creating a rule, which is quite inflexible. Additionally it limits the size of sets since netlink messages and attributes are limited to 64k, so we need to be able to incrementally populate sets.
  • Automatic selection of set implementations: currently set implementations are selected by name provided by userspace. Since only the kernel knows about the available implementations and which one is best suited for some specific data characteristics, it should perform the selection.
  • Way too much work on the userspace frontend to describe here. I'll start working on a TODO list soon.
  • Harald asked for some higher level library functions in userspace to enable 3rd party applications to easily change filtering rules. It doesn't really fit in the design too well since userspace is effectively a compiler, but some standalone library providing some common helpers are indeed a good idea.
  • Additionally Harald suggested to provide a description of filtering features to lets say UIs, so they can detect new features automatically and offer them to the user. While I initially was quite sceptical of the technical feasibility of this, it should be possible to do this for what will probably be the major kind of new features, new primary expressions that describe things like new meta data fields and, within limits, new target modules. I like this idea more and more since we've never had any usuable interface for userspace applications, which severly limited there possibilities. Offering this will will hopefully result in a lot of interesting new applications.
Some people voiced concerns that the very integrated approach nftables is taking might increase the entry barrier for new contributors. While I don't consider it bad to have slightly higher barriers than what we have now, I don't believe this will happen. First of all, a lot of things that required to hack the kernel previously can now be done in userspace, which usually people are more familiar and more comfortable with. In fact hope this will lead to a shift in contributions from new matches to new algorithmic improvements - so far this wasn't really feasible and the few people who've tried usually ended up writing an entire new userspace implementation. Second of all, new matches can be implemented even easier than previously since you only have to add the part collecting the data and provide a description of byteorder, length etc. to the userspace frontend. It will automatically be usable with all relational expressions, sets, dictionaries, as argument for parameterized targets and so on. Userspace integration is also quite easy since the generic parser takes care of most of the things.

I've started releasing the code, the kernel part and the libnl part are available on the netfilter git server, and temporarily because of memory shortage causing git failures, on git.kernel.org as well. The userspace frontend code contains one or two spots well below my own standards which I want to clean up before releasing it. Expect an update within the next couple of days.

Besides the workshop and technical stuff, we've also had lots of nice dinners and parties, including a boat ride on the Seine with good food and a seemingly endless supply of champagne and other beverages, a nice party at an art gallery with more champagne and even for lunch usually enjoyed some great french cuisine. It was nice to meet all the people working on netfilter again; its a very friendly bunch of people and a highly cooperative atmosphere, which makes working on the project a real pleasure.

Its also great to see that after 10 years, the project is still very alive and active and enjoys a great deal of support from both companies and users. This year INL took care of the organization and did a truely amazing job. A big thanks to Eric Leblond, Vincent Deffontaines and everyone else from INL.

The project also wouldn't be where it is without all the support Astaro has been providing over the years, starting with funding Harald's and my work for many years now, being the main sponsor for every workshop we've had so far and a lot more. A big thanks to Gert Hansen, Markus Hennig and Jan Hichert from Astaro.

Even with all the support from INL and Astaro, this years workshop wouldn't have been possible without all the other sponsors (Paris is insanely expensive), thanks as well to:

Some pictures from Paris ...

The group picture

Pablo tries to convince Dave to merge his patches :)

"Oh crap, what have I done" :)

Lunch with the INL guys at a nice restaurant - payed for privately of course.

Tue, 02 Sep 2008

nftables part II


A lot has happened since my last posting about nftables, so here's an update with the current status. I unfortunately failed to reach my goal of being able to replace my iptables based firewall by the end of August, but I'm getting close :)

New kernel modules

I've added a few new kernel modules for missing functionality:

  • A conntrack module, replacing xt_state, xt_conntrack, xt_helper and xt_connmark. It loads a specified item from struct nf_conn and related structs into a user defined register.
  • A logging module, which simply calls nf_log to pass the packet to the active logging backend.
  • A limit module, replacing xt_limit. While at it, I extended the possible range of values so we can use higher limits and don't loose as much precision. Hashlimit is still missing, I'm considering merging them into a single module.
  • A module to wire up bridging with nftables. It consists only of a few lines of code to register a data structure containing the highest hooks value, a module reference and the family. Transport header payload matching doesn't work yet though since it also needs to initialize the offsets.
For all these modules I've also implemented their respective userspace counterparts in libnl and the nftables userspace frontend. With these modules, a large part of the existing iptables and ip6tables matches are covered. More about that below ...

Payload expressions

I've added tons of new payload descriptions, it now supports:

  • Ethernet header: fully covered
  • IP header: no options
  • ICMP header: fully covered
  • IPv6 header: fully covered
  • AH header: fully covered
  • ESP header: fully covered
  • COMP header: fully covered
  • UDP/UDP-Lite header: fully covered
  • TCP header: no options
  • DCCP header: only ports
  • SCTP header: only base header members (ports, vtag, checksum)
Effectively this means its fully covering the fixed size portion of all headers. Chained header parsing can't be expressed very efficiently using offsets and lengths, so I'll probably add special modules for them. So far it seems we need a IP-style option parsing module for IP, TCP and DCCP options, as well as an IPv6 extension header parsing module. And then there is SCTP ...

Optimizations

Constant folding

Constant folding is now implemented, propagation of operations to the constant size is still missing.

Adjacent payload load merging

I've added a first rule level optimization, adjacent payload expression merging. When payload expressions refer to consecutive header fields, we can merge the load operation

  • when the payload expressions are contained in match statements and the match expressions are equality expressions. In that case, we can simply join all the LHS and the RHS expressions (after reordering them to match the order of the fields in the header).

    Syntax: ip saddr 192.168.0.1 daddr 192.168.0.100
           [ payload load 8b @ network header + 12 => reg 1 ] 
           [ cmp reg 1 0x0100a8c0 0x6400a8c0 ] 
       
  • when the payload expressions are part of a concatentation. This is by definition a merged expression.
During ruleset dumping, the merged expressions are expanded again to the original components.

So far the expressions are mindlessly merged together, without regard to maximum size, alignment or whatever. The size must of course be taken into account since the kernel data types are fixed. Alignment currently doesn't matter, but I'm planning to optimize small, aligned data loads in the kernel, either by overloading the evaluation function or just special-casing. Providing the ability to overload operations probably would be useful for other common case optimizations.

Value range tracking

This is not fully finished yet, but will allow two nice things when it is. The purpose is to track constraints of dynamic expressions when operations are applied to them. For instance, its easy to see that the expression "ip daddr & 0xf" can only take the values 0x0-0xf. More generally, we can track constraints bitwise and use them to determine the possible input range for lookups in dynamic sets and use that to choose the optimal representation. It can also be used to determine ineffective matches, but it only works on each expression separately and thus is just a subset of ineffective rule detection.

Type checks and error reporting

Expressions are fully checked for type compatiblity now. This includes sets, maps and concatenations.

  • Basic type mismatches:

    Syntax: rule add filter output ip daddr 22
        :1:33-34: Error: Datatype mismatch: expected IPv4 address, got numeric
        rule add filter output ip daddr 22
                               ~~~~~~~~ ^^
       
  • Set type mismatches:

    Syntax: rule add filter output ip daddr { 22, 23}
        :1:35-36: Error: Datatype mismatch: expected IPv4 address, got numeric
        rule add filter output ip daddr { 22, 23}
                               ~~~~~~~~   ^^     
       
  • Map type mismatches (only RHS shown, LHS is similar to set):

    Syntax: rule add filter output meta mark == ip daddr map { 192.168.0.1 => 10.0.0.1 }
        :1:67-74: Error: Datatype mismatch: expected numeric, got IPv4 address
        rule add filter output meta mark == ip daddr map { 192.168.0.1 => 10.0.0.1 }
                               ~~~~~~~~~                                  ^^^^^^^^  
       
  • Concatenation type mismatches:

    Syntax: rule add filter output ip saddr . daddr { 192.168.0.1 . 22 }
        :1:60-61: Error: Datatype mismatch: expected IPv4 address, got numeric
        rule add filter output ip saddr . daddr { 192.168.0.1 . 22 }
                                          ~~~~~                 ^^  
       
Type checks happen during type promotion, which needs to fully expand the expression, so as shown in the map example, type checks are even performed across multiple operations.

Open problems

There have been more changes, but I'll write about those another time. There are of course also a lot of open problems, mostly nothing too complicated, but also a few tricky ones.

Match expression parsing

So far, all header expressions except addresses are treated as numerical values. So you can't say "tcp flags SYN/SYN,ACK", but have to specify it numerically. This needs to be fixed of course. The main problem is that both sides of a match expressions are constructed through separate productions, for example the rule for a relational expression is:

  relational_expr	: expr	relational_op	expr
 
The LHS expression might be a payload expression and the RHS a constant to compare. When parsing the RHS, we don't have the necessary context to accept special tokens related to the LHS. The reason for specifing the grammar like this is that dynamic expression can not only occur in matches, but also as arguments to targets or f.i. mappings.

There are basically two possibilities to fix this that I can think of.
  • Introduce new subtypes for every numerical type that can take symbolic values, like TCP flags, IP options, realms, etc. The downside is that the number of types is potentially huge.
  • Add special parsing callbacks to every dynamic expression for the RHS and accept unquoted strings as "to be resolved" type. The main problem with this approach is that its pretty likely that sooner or latter, keywords and arguments would clash, requiring increasing amounts of ugliness in the grammar.
Other suggestions are happily accepted because I'm not fully convinced of either possibility, though I'm leaning towards the first one. The alternative thats sounding more and more attractive is to use a hand written parser.

Again, match expression parsing

Another problem resulting from the fact that we don't have any context once an expression is parsed is that multiple related matches are cumbersome to specify. I've actually been cheating in my examples and shown the syntax as it should be, not as it currently is. For example you can't write "tcp sport 1024: dport 22", but have to write "tcp" twice to provide the necessary context: "tcp sport 1024: tcp dport 22". The only way I see that might *possibly* work (while keeping bison) is to do some YYBACKUP look-ahead token manipulation. Not too appealing either ...

Fri, 29 Aug 2008

Kernel Summit


I've finished my travel arrangements and I'll be in Portland from Sep. 14th until Sep. 21st. for Kernel Summit and the Linux Plumbers Conference. I'm especially looking forward to the Plumbers Conference since the topics look quite interesting and thematically the conference seems to nicely fill a void. There's even a mention of iptables in one of the presentations titled Where should the Line be drawn between kernel and user space? :

"An obvious example of this is ip_tables; it occupies a massive amount of kernel code, all for the purpose of enforcing user defined policy on incoming and outgoing network packets. The complexity of the policy (you try configuring ip_tables in all its glory) necessitates the huge size of the code base to do this. Unfortunately, IP packet routing policy can't simply be ejected to user space if one expects the IP stack to function at speed on a modern network, so we're apparently stuck with it."

I'm optimistic the code base in the kernel will become a lot smaller in the future :)

Wed, 20 Aug 2008

nftables


After almost three weeks of silence, following is an update on the successor for iptables I've been working on. For lack of creativity, its called nftables so far, but I might still change that. Its writen from scratch and has lots of differences to the old design, so I'll start with a description from the bottom up of what happens in the kernel.

Don't read this if you're attending the netfilter workshop or you will be bored :)

The userspace interface

The userspace interface is, of course, based on netlink (nfnetlink to be precise). Three main object types exist: tables, chains and rules. As with iptables, tables are just containers for chains that belong to the same protocol family. Unlike with iptables, only tables that need special functionality (like NAT or mangle, now called route) are implemented in the kernel, normal tables can be created and deleted by userspace. Chains exist in two flavours, a base chain is a chain that registers a netfilter hook for receiving packets and is an entry point for a table. A table can contain an arbitary amount of base chains, even multiple for the same hook. There are no default counters for base chains anymore, which allows to only register the hooks when rules are present in the chain without userspace visible impact. Regular chains are just containers for rules and are similar to iptables chains. A rule is a container that contains multiple expressions and statements for matching packets and performing actions.

Expressions

One of my major dislikes with iptables was the huge amount of modules for performing very similar tasks (match on TCP ports, match on UDP ports, match on DCCP ports, match on mark value, match on connmark value. ...) with often slightly different feature sets (support for negation, support for ranges, ...) and the fact that every match or target available to the user required a kernel module implementing exactly that functionality. This is all gone :)

There is only one kind of extension, called an expression. It can represent statements (like log, verdicts), unary expressions (get TCP port), binary expressions (compare data) or (optionally runtime parameterized) targets. They communicate with the core or other expressions through a register stack. The register stack contains a number of general purpose registers and a special verdict register, which can be used to change control flow in the classification core. A single data type is used to represent any kind of data, be it ports, meta data, data gathered from stateful modules, or even verdicts, including jumps. Userspace is responsible for chaining expressions appropriately to get the desired semantic.

The modules implemented so far are:

  • A payload gathering module: this module loads payload data at a specified offset and length from the packet into a userspace specified register.
  • A meta data gathering module: this module loads meta data like mark, realm, ... into a userspace specified register.
  • A constant data module: this module loads data provided by userspace in a userspace specified register. Mainly used for returning verdicts.
  • A counter module: this modules counts bytes and packets.
  • A module for simple relational expressions: implements equality and relational expressions. When expression is false a verdict to abort evaluation of the current rule (NFT_BREAK) is returned.
  • Multiple modules for performing set lookups: these module take data from a userspace specified register and look it up in either a rbtree, a hash or a bitmap (depending on the range of the data and whether the set should be manipulable). When the value is not found, they return NFT_BREAK by default, which can be used for representing matches. They can alternatively load data associated with the key into a userspace defined register, which allows to implement data and verdict dictionaries.
  • Some additional modules like a NAT module, that are not particulary relevant for describing the new architecture.

Tieing it up

Here are a few examples of what this can be used for. Don't worry, the userspace interface presents this all in a nice fashion and users don't have to care about registers, offsets etc. :)

  • An exact match on the IP destination address:

    Syntax: ip daddr 192.168.0.1
        [ payload load 4 offset network header + 16 => reg 1 ]
        [ compare reg 1 192.168.0.1 ]
       
  • A match on a set of IP destination addresses:

    Syntax: ip daddr { 192.168.0.1, 192.168.0.2}
        [ payload load 4 offset network header + 16 => reg 1 ]
        [ set lookup reg 1 { 192.168.0.1, 192.168.0.2 } ]
       
  • A verdict dictionary for mapping the IP destination address to verdicts:

    Syntax: ip daddr { 192.168.0.1 => jump chain1, 192.168.0.2 => drop, 192.168.0.3 => jump chain2 }
        [ payload load 4 offset network header + 16 => reg 1 ]
        [ set lookup reg 1 load result in verdict register
          { "192.168.0.1" : jump chain1,
            "192.168.0.2" : drop,
            "192.168.0.3" : jump chain2 } ]
       
  • A generic dictionary used for mapping IP destination addresses to netfilter marks:

    Syntax: meta mark map ip daddr { 192.168.0.1 => 0x1, 192.168.0.2 => 0x5, 192.168.0.3 => 0x2 }
        [ payload load 4 offset network header + 16 => reg 1]
        [ set lookup reg 1 load result in reg1
          { "192.168.0.1" : 0x1,
            "192.168.0.2" : 0x5,
            "192.168.0.3" : 0x2 } ]
        [ meta mark reg 1 ]
       

More cool stuff - multidimensional exact matches in O(1)

This feature is called concatentations, which allows to dynamically concatenate multiple keys and use them for a lookup. This allows to do multidimensional exact matches in constant time when combined with a hash based set. This example shows how to use concatenations for filtering on (mac, ip saddr) combinations for antispoofing:

Syntax: mac src . ip saddr { 00:1b:21:02:6f:ad . 192.168.0.100, 00:01:36:0d:d0:71 . 192.168.0.1 }

The kernel side implementation is not done yet though :) There are two ways how to represent this, either by specifying offsets with data loads:

  [ payload load 6 offset linklayer header + 6 => reg 1 ]
  [ payload load 4 offset network header + 12  => reg 1 offset 6 ]
  [ set lookup reg 1
    { 00:1b:21:02:6f:ad . 192.168.0.100,
      00:01:36:0d:d0:71 . 192.168.0.1 } ]
 
alternatively, concatenation may be implemented as separate operation:
  [ payload load 6 offset linklayer header + 6 => reg 1 ]
  [ payload load 4 offset network header + 12  => reg 2 ]
  [ concat reg 1, reg 2 => reg 1 ]
  [ set lookup reg 1
    { 00:1b:21:02:6f:ad . 192.168.0.100,
      00:01:36:0d:d0:71 . 192.168.0.1 } ]
 
The first way would be preferrable, but it mainly depends on how much overhead this adds for the much more common case of not using concatentations in a rule.

Besides the more powerful expressions listed above, there are a few simpler ones as well, offering basic binary and logical operations etc. So far not many target modules exist, but some things that have been going through my head:

  • Parameterized hashlimit: hashlimit based on user-defined keys, something like
    hashlimit ip saddr / 24 . ip daddr / 24
    for instantiating a limit state for each combination of /24 networks talking to each other.
  • Parameterized NAT target: could be used to represent masquerading:
    snat ifaddr meta oif
    or NAT pools:
    snat map ip daddr & 0xf { 0 => addr1, 1 => addr2, 2 => addr3, ... }
    or randomized NAT pools:
    snat map random & 0xf { 0 => addr1, 1 => addr2, 2 => addr3, ... }

Userspace

Netlink communication

Low level netlink communication is entirely implemented in libnl. It allows to create tables, chains, rules, expressions and data, but doesn't know anything about higher level constructs.

The userspace frontend (nftables)

This is what the users actually interact with and where the intelligence lies. Since its still very much in flux, only a few bullet points.

From textual representation to the kernel

The parser is bison based with a real grammar. I might replace the parser later on since bison has its own set of problems, but using it during development is very useful for quickly changing grammar and verifying that its non ambiguous. It performs basic semantical validation and constructs a syntax tree, which is then post-processed. So far post-processing consists of:

  • More semantical validation for invalid constructs that can't be determined during parsing.
  • Byteorder conversions: constant values have their byteorder converted to match the byteorder of dynamically gathered data.
  • Type checks and conversions: types are checked for compatibility, when necessary (and possible) constant values are promoted or demoted to match the type of dynmically gathered data.
  • Dependency generation: some expressions have dependencies that are generated automatically when possible to save the user from unnecessary complications. This mainly affects higher level protocol matches, like "tcp dport 22", which requires to make sure the transport layer protocol is actually TCP, so the dependency is "ip protocol tcp".
There are still a few important things missing, like constant folding and propagating binary, logical and arithmetic operations from dynamically gathered data to the constant side of a relational expression, when possible.

The post processed tree is then linearized and fed into libnl. During linearization register allocation is performed to propagate values as required by an expression.

Reconstructing textual representation

A very important feature, one that is missing from all other filters that are built similar in the kernel (like BPF, TC u32 filter, ...), is reconstruction of high level constructs from the representation within the kernel. TC u32 for example allows you to specify "ip daddr X", but when dumping the filter rules it will just display an offset and length.

So when dumping a ruleset, nftables will reverse the steps performed during post-processing and linearization, which means reconstructing the syntax tree based on how expressions are chained, reconstructing high level meaning of payload and other expressions, eliminating redundant dependency expressions, etc. This works pretty well and the dump output currently looks exactly like the user specified input. This will not be 100% possible in the future though since f.i. constant folding is not reversible.

Ideas ...

A few ideas what else might be done in userspace. Most of it isn't planned for an initial release though.

  • Since we have an abstract representation of the entire ruleset, we can easily determine different match dimensions dynamically. This allows to perform optimization on a ruleset (or maybe chain) level by reordering rules, combining similar matches into sets, etc. It also allows to detect ineffective rules due to shadowing and eliminate them or warn the user.
  • Improved register allocation: so far register allocation is performed on a per rule level. This is necessary if individual rules may be added or deleted since we can't rely on register contents as set up by previous rules. With a small change to mark chains immutable however, redundant loads can be eliminated. Consider for example these two rules:
        udp dport 53 ...
        tcp dport 22 ...
       
    Both expression refer to the same payload (2b @ transport header + 2), but their preconditions differ (ip protocol 17 vs. ip protocol 6). Their representation looks like this:
        rule 1:
        [ payload load 1 offset network header + 9 => reg 1 ]
        [ compare reg 1 17 ]
        [ payload load 2 offset transport header + 2 => reg 1 ]
        [ compare reg 1 53 ]
    
        rule 2:
        [ payload load 1 offset network header + 9 => reg 1 ]
        [ compare reg 1 6 ]
        [ payload load 2 offset transport header + 2 => reg 1 ]
        [ compare reg 1 22 ]
       
    Both the protocol load and port load are redundant, so this can be written as:
        [ payload load 1 offset network header + 9 => reg 1 ]
        [ payload load 2 offset transport header + 2 => reg 2 ]
    
        rule 1:
        [ compare reg 1 17 ]
        [ compare reg 2 53 ]
    
        rule 2:
        [ compare reg 1 6 ]
        [ compare reg 2 22 ]
       
    or even more compact by using concatenations:
        [ payload load 1 offset network header + 9 => reg 1 ]
        [ payload load 2 offset transport header + 2 => reg 1 offset 1 ]
    
        rule 1:
        [ compare reg 1 17 . 53 ]
    
        rule 2:
        [ compare reg 1 6 . 22 ]
       
    which brings the number of expressions down to 1/2.

Sun, 27 Jul 2008

Lawyers night out


I was wondering what might have happened to the guy when seing some dude in a suit without trousers walking through the city, trying to cover his hip area with a briefcase.

This somewhat explained it.

Now I'll wait for the inevitable 25 notices for violation of § 22 KunstUrhG :)

Netfilter status


I released iptables 1.4.2-rc1 last week, which will become the release for Linux 2.6.27. The biggest change are some scalability improvements for initial ruleset parsing from Jesper Dangaard Brouer that should greatly improve performance with large rulesets. Besides that there are some cleanups, manpage updates, case insensitive string match support (needs current -git kernel) and --goto support for ip6tables. Please test and report any issues you might notice.

On the kernel side, the 2.6.27 merge window was rather unexciting from a netfilter perspective and is now closed. Besided the usual cleanups, we have

  • a new security table for use with SELinux
  • improved support for network namespaces
  • some fixes for conntrack accounting to properly account terminating packets
  • a patch to make conntrack accounting a ct_extend extension, which means its now possible to enable/disable it at runtime. It also switches back to use 64 bit counters, which are needed by the connbytes match.
  • case insensitive string match support
  • IPv6 support for ebtables
  • SCTP support for ctnetlink
Still missing is the completion of network namespace support and Jan Engelhardt's patches to make ebtables and arptables use the xtables infrastructure. Both patchsets were posted in the last minute and unfortunately missed the merge window.

My plans for the near future are to continue VoIP testing for the remainder of July since I have to return the test equipment by the end of the month. August is entirely reserved for finishing the missing bits of my iptables successor so I'll have something presentable for the Netfilter Workshop 2008 at the end of September. I'll probably post some details about the design during that time, so please be patient.

Wed, 09 Jul 2008

VLAN update addendum


I forgot to mention the main unsolved issues with passing the VLAN TCI to userspace over packet sockets - the mmap'ed packet socket ring doesn't include any auxillary data and is not extendable without breaking binary compatibilty. So we can't add the VLAN TCI. It also suffers from a different problem, its not 64 bit clean due to use of an unsigned long. So its basically a lost cause, which is why I've introduced a new version of the mmaped packet socket protocol that is both extensible and 64 bit clean.

The only part of the frame structure that depends on a known structure size is actually the struct sockaddr_ll that follows the struct tpacket_hdr. The new protocol version uses a new struct tpacket2_hdr, which can be extended by adding new members at the end. Userspace has to issue a getsockopt(PACKET_HDRLEN) call to get the length of struct tpacket2_hdr the kernel uses and use that length to calculate the beginning of the struct sockaddr_ll, in case it is actually interested in that struct. To switch to the new protocol version, a setsockopt(PACKET_VERSION) call specifying the protocol version has to be issued before configuring the ring parameters.

With this change, all major issues are solved, so I've reposted the entire set , this time for merging.

Tue, 08 Jul 2008

VLAN update


Dave just merged the second part of my VLAN update for 2.6.27. It contained nothing particularly interesting, mainly minor cleanups, uninlining, ethtool support for querying offload settings and a minor fix for incorrect header pointer adjustments with software tagging. The more interesting part is contained in the final update, which I just sent out as RFC .

There are a number of inconsistencies when using VLAN with hardware acceleration with respect to visibility on packet sockets, which these patches attempt to cure. Linux supports three modes of VLAN hardware acceleration:

  • VLAN tagging: the VLAN code passes the VLAN TCI to the driver in the skb's cb, the hardware inserts the VLAN tag using this TCI when sending the packet out.
  • VLAN stripping: the hardware strips the VLAN tag on RX and passes the TCI to the driver in the RX descriptors. The driver uses a special variant of netif_{rx,receive_skb} that takes the TCI, extracts the VLAN ID, looks up the virtual VLAN device and calls netif_{rx,receive_skb} with the device set to the VLAN device.
  • VLAN filtering: the hardware is programmed to filter out all but the locally configured VLANs.
The inconsistencies resulting from this design are:
  • With VLAN tagging, outgoing packets travel through the stack untagged. This means they appear as regular ethernet packets to tcpdump. Additionally the use of the skb's cb is wrong since other layers may use it for their private data. The netem qdisc actually does this, corrupting the VLAN TCI.
  • With VLAN stripping, incoming packets bypass packets sockets on the real device completely. Packets for locally configured VLANs are visisble on the VLAN device only, packets for unknown VLANs not at all.
  • With VLAN filtering, some drivers disable VLAN filters in promiscous mode, some don't. With those that don't, packets for unknown VLANs are not visible. Most drivers don't enable VLAN filtering until the first VLAN is configured, which is even more inconsistent.

Since running tcpdump is an exception, fixing this behaviour should not affect or disable the optimizations provided by hardware acceleration. The approach taken by my patches is to promote the VLAN TCI from the cb to a full skb member to avoid the netem corruption and keep it intact within the packet socket code, which is also using the cb itself. A new member is added to the packet socket auxillary data to store the VLAN TCI, allowing userspace to sense that a packet is actually a VLAN packet and (re)construct the VLAN tag. On RX, the hardware acceleration netif_{rx,receive_skb} wrappers store the VLAN TCI in the skb and manually invoke the ETH_P_ALL packet handlers before receiving the packet on the VLAN device. Combined with a patch for libpcap to perform the VLAN tag (re)construction, this fixes the first two issues. One minor remaining issue is that socket filters for VLAN packets don't work as intended since they expect a VLAN header. Since userspace needs to know about VLAN acceleration anyway, it seems reasonable to put the burden on userspace by providing a new filter instruction for getting the VLAN TCI from the skb's meta data and expecting it to construct the filters accordingly.

To fix the third issue, the drivers need to be modified to provide the desired semantic. Their initial state should be to filter out all VLANs. Currently most of them only enable filters when adding the first VLAN, which is clearly suboptimal since previously all VLAN packets are uninteresting, except when in promiscous mode. When adding new VLANs, the filters should be adjusted to allow their respective IDs, this is done correctly by all drivers. Finally, in promiscous mode, all filters should be disabled. I'm half way done modifying the drivers to provide this behaviour (all Intel drivers), for most of the remaining ones I'm not sure about their current behaviour since its unclear whether the promiscous mode offered by the hardware automatically disables VLAN filtering or not.

Additionally it turned out during testing that about half the drivers performing VLAN stripping didn't provide the full TCI but only the VLAN tag to the HW acceleration RX functions. The upper bits of the TCI contain the VLAN priority, which is used for ingress priority mappings so far, with my patches it also affects tcpdump visibilty. The fixes for this are already in net-next-2.6.git.

What's next
Since with these changes the skb is able to carry the VLAN TCI across layers, we are now able to provide VLAN acceleration to virtual network devices by adding a software fallback, similar to how TSO works. This would allow to use hardware tagging from within network namespaces or other virtualized environments. Additionally there are two more inconsistencies that people have been complaining about:
  • New VLAN devices can only be created when the lower device is UP. They happily continue existing when setting the lower device DOWN though. The reason for this is apparently that some drivers can't cope with having the RX filter programming callback invoked while in DOWN state. It seems the fix with the least risk for this would be to defer filter programming until the lower device is UP.
  • When setting the lower device DOWN, its VLAN devices are put in DOWN state as well, deleting all routes pointing to them. When setting the lower device UP again, the VLAN devices stay down. Even if they were automatically set UP again, the kernel can only reconstruct the automatically created routes. This one probably requires a flag day on which the behaviour will be changed, alternatively a new flag to specify the desired behaviour.
One more nice thing would be to avoid the netif_rx() scheduling overhead when receiving packets on a VLAN device, which reportedly costs around 1-5% performance. The reason for not calling netif_receive_skb() directly is to keep stack usage low. An idea that has been tossed around that would also benefit other virtual devices is to modify the packet handlers to be able to return a new packet to netif_receive_skb().

Thu, 03 Jul 2008

GARP/GVRP


Just finished the GARP/GVRP patches and sent them out. I love the feeling when you can finally delete a tree that has been lieing around for ages :) Unfortunately I have way too many of these.

A few words on GARP/GVRP for those not familiar with it. GARP is the Generic Attribute Registration Protocol and is specified in IEEE 802.1D. It is used to register and propagate attributes through the active spanning tree topology. Examples of these attributes include multicast link layer addresses and VLAN IDs. A bridge can use the attributes to configure filtering, a host can perform source pruning, meaning it can avoid sending f.i. multicast frames noone is interested in. Source pruning requires the host to be a full participant however, my implementation is only of the applicant-only participant model, meaning it supports only the client side. The full participant belongs in userspace since it may have to create network devices etc.

GVRP is the GARP VLAN Registration Protocol, specified in IEEE 802.1Q. As the name implies, it is used to register VLANs. This is supported by many switches, even the really cheap ones. The current implementation doesn't enable GVRP by default since we're missing a way to disable it when a VLAN device gets added to a bridge. I'll probably fix that shortly, for now it has to be enabled manually using iproute (ip link set eth0.1000 type vlan gvrp on).

Dynamically sized qdisc class hashes


Just sent out the second version of my dynamically sized qdisc class hash patches . They are intended to solve scalability problems when using large number of classes with CBQ/HTB/HFSC (and soon DRR). Currently, all of these use a fixed hash size of 16. There are mainly two cases where this matters:

  • Classifiers that point to classes that don't exist at the time the classifier is created are "unbound", meaning they don't return a pointer to the class but the classid. This classid is then resolved by a hash lookup. With only 4096 classes this means walking 128 classes on average - for each packet.
  • The flow classifier never returns class pointers since it calculates the classid at runtime. This means the class always has to be looked up in the hash.
The patches introduce some common hash helpers and a struct Qdisc_class_common, which currently only contains the hlist_node and the classid. I'm hoping we can stuff some more things in there and do some further consolidation. Child qdisc, classifier lists etc. come to mind.

Now I'm off to do some final cleanups of the GARP/GVRP patches so I can hopefully send them out today as well.

Mon, 30 Jun 2008

Bored kids


These two kids were looking for something fun to do on Saturday evening. They first tried to climb the wall of the Freiburg theater (visible in the back of the picture), but the left one only made it half way up. Very disappointed, he stated he felt emasculated and both of them left. He tried to regain his manliness a few minutes later when they returned with a stolen table from the bar next door and used it to slide down the stairs.

Getting ready to launch ...

They made it down the stairs. The left one is really having fun, the right one is also beginning to feel like a man again.

End of the ride.

Also don't miss out on the movie of the waitress slapping them around and their flight.

Thanks to Elena for the pictures and redacting the face of an innocent bystander :)

Mon, 23 Jun 2008

Companies to avoid


Warning: long rant. Note to any company mentioned below: this is *my* opinion and my opinion only.

Just added Lenovo to the list of companies I won't buy from anymore. I got a Z61p about 9 months ago and had nothing but trouble since. Air circulation seems to be broken by design, from the beginning the graphic card overheated to over 100° celsius, then starting makeing squeaky sounds and showing flickering moving lines across the entire screen, before shutting down completely (not the notebook, only the card). There are plenty of reports on the internet of people having similar problems. The processor also often heats up until it reaches the shutdown temperature when doing CPU intensive work. A technician tried to fix it by replacing some parts, but without any success. Additionally on my travel to LinuxTag, the fan broke and it would refuse to boot. This cured itself a few days later, but now it sounds like a rusty lawn mower. Since last weekend, it doesn't detect the battery anymore, even though a voltmeter shows its working perfectly fine. Sad, back when IBM was still producing ThinkPads I never had trouble.

The other two companies on this list of pride are (there are more, but most of them are irrelevant since they are either almost broke or small enough to avoid easily):

- Deutsche Telekom and all their subsidiaries. This is the most ridiculous company I've ever seen, the highlights of their doings include:

  • When installing a DSL connection in my appartment, the technician (also known as Telekomiker, roughly translated as Telekom commedean, but more funny in German) wasn't able to locate the wires in the switch cabinet. So since there was no way to test what we was doing, he apparently decided not to do anything at all and just left the wires in the socket unconnected, after which he told me "all done" and left. I managed to get him on the phone personally by calling a hotline, telling them my story and being forwarded like eight times. He showed up again two hours later and did it properly :)
  • Delay and displace contract changes and cancellations, not sending confirmations and so on. The most recent incident was billing us for months for a canceled telephone line, which numbers were already ported to a different company. On every call to their support, they promised to look into it, only to send a new invoice a few days later. Most ridiculous, when calling their business support late at night, we reached a call center, with a friendly telephonist who told me she was unable to even take a message, the only thing she was there for was for telling us we were calling outside business hours. At least the amusement made up for it a bit :)
  • Sending incorrect invoices over months for some equipment returned to them within the deadline for returning it. As usual, their phonebots proved to be completely incompetent and weren't able to correct the mistake. It went the usual way to their law firm, Seiler & Kollegen, who did correct the mistake after we sent them the a copy of the receipt. So what Telekom did after that was "correct" their invoice, remove the incorrect position, but left all the reminder charges (for an *incorrect* position to begin with) on it. They continued sending invoices with increasing reminder charges (1 euro per invoice) for over two years. Not sure if they're still sending them, I stopped carring.
I've decided not to ever do business not only with Telekom and any of their subsidaries, but also with any company reselling their products or being closely affiliated with them. You just live better this way.

- HP, for not fulfilling their service obligations for fixing my notebook. They first sent some clown from Deutsche Telekom to fix it, who broke it even worse. His second attempt was also unsuccessful, after which HP simply closed the request. Every time I reopened it, it was closed again without further comment. They even had the impudence to ask for my satisfaction with their service - which was pretty obvious from looking at the request. Also sad, because I liked the notebook and there are not many alternatives if you want a big display, but such behaviour is inacceptable.

Sat, 21 Jun 2008

Too busy to blog


Since I've been slacking with updating this blog lately, and probably will continue to do so for the next week, here's another drawing from Elena from a couple of weeks ago.

Tue, 17 Jun 2008

iptables 1.4.1.1 released


Just released iptables 1.4.1.1, a pure bugfix release for regressions reported against 1.4.1. Besides this, I'm mainly in bugfix mode for 2.6.26 currently, which is keeping me pretty busy.

Tue, 10 Jun 2008

iptables 1.4.1 release


Finally released iptables 1.4.1 this morning. I had the impression the -rc phase worked pretty well this time and hoped we had shaken out all the bugs. Unfortunately this hope wasn't fulfilled, the first regression report came in only 5 hours later. Its nothing terribly important, just a cosmetic problem when printing IPv6 masks, but I guess I'll release a 1.4.1.1 bugfix release in a few days.

Thu, 05 Jun 2008

Release delays


The iptables 1.4.1 release got delayed a bit by my notebook breaking 5 minutes after getting on the train to LinuxTag, so I couldn't do any real work the entire last week. I hoped we could test the header fixes last week and release on Monday, but they really need some wider testing, so I'll release another -rc today and hopefully the final release in about a week.

On the kernel side, I'm working on getting the things I would like to merge in 2.6.27 into shape. The netfilter things in my queue so far are mostly minor cleanups and feature additions, with the exception of ebtables IPv6 support from Kuo-lang Tseng.

The non-netfilter things are:

  • A GARP implementation with a GVRP application on top. Maybe I'll also add GMRP support.
  • Some VLAN patches to fix inconsistencies related to hardware tagging/stripping/filtering and packet sockets.
  • A DRR packet scheduler.
  • Some patches for dynamic packet scheduler class hash sizing for better scalability that I've been carrying for at least a year.

Wed, 28 May 2008

iptables release status


I managed to push out an iptables release candidate last week and another one this week. There are some issues with endian-annotated types in the netfilter headers when using ancient linux/types.h versions that need to be fixed before a final release, but we'll hopefully have a final 1.4.1 release by next monday.

Leaving for LinuxTag


I'll be leaving for my train to Berlin now. Amazingly it took me more time to decide what to put on my notebook than what to pack in my suitcase :) Hope to see you there ...

Fri, 23 May 2008

Flying outside


did not work to well ...

Geek toys


Received a bunch of things to play with this morning.

A customer wants me to develop a multipath tunneling protocol. The basic idea is to encapsulate packets, distribute them over N paths, decapsulate them on the other side and restore ordering using sequence numbers. This will allow to use both the combined upstream and downstream bandwidth for single connections. I wrote a prototype some time ago, but its very basic at this point and missing all the fancy features, like microflow seperation, delay aware distribution, dead path detection, etc.

I used to have two internet connections for a couple of months, but canceled the second one in march. Since there's nothing like real-life testing, I ordered a second cable connection again, which was installed this morning. So now I have two 32/2.5mbit connections, but only one of them is used currently. I probably won't be able to resist the urge to work on this for long :)

Additionally I received some VoIP testing equipment. Innovaphone kindly provided me with a H.323/SIP test setup consisting of 2 * 3 different telephone models, two PBXs and some test scenarios. We already support a lot of scenarios, my goal is to extend this as far as possible within reason.

The tricky part are things like call transfers crossing two NATs:

  Phone1-\                                              /------Phone3
          -[Registrar1]-[NAT1]-{  }-[NAT2]-[Registrar2]-
  Phone2-/                                              \------Phone4
 
When Phone1 calls Phone3 and the call is transfered to Phone2, we currently fail completely (for so far unknown reasons). The ideal outcome would be that NAT1 detects that the transfered call originated in the local network of Phone1 and the RTP streams are set up between Phone1 and Phone2 directly. This not only reduces latency, it avoids having internal calls go over the Internet. For normal calls between Phone1 and Phone2 using an external registrar this already works, provided that the registrar doesn't decide to proxy the calls.

Speaking of geek toys, ThinkGeek has an incredibly fun toy..

The two helicopters are controlled using IR remote controls. They can only go up and down by controlling main rotor speed and spin around the rotor-axis using the rear rotor, but have some small constant forward movement, which allows to fly them quite precisely. The most important feature however is that they can shoot at each other using IR. On the first hit it spins a bit, on the second one it looses power for a short period of time, on the third hit it completely looses power and goes down. This is accompanied by shooting sounds. Unfortunately they break pretty fast, I wrecked four of them within only a few days - well actually three of them, the fourth one was last seen flying over my neighbours garden :)

Being hooked on the fun, I got a a different model , which is controllable on all axis and supposed to be more robust.

Unfortunately it also has a lot more power, so I managed to wreck another one within hours: one gear broke, the tail bent to a 45° angle and then broke, the flight bar also took some damage. To be fair, it really is more robust (and luckily you can also order replacement parts), I'm just not used to the two two-dimensional controls. Instead of powering down when getting to close to the walls, I tried directing it away by pushing all sticks in the desired direction, causing it to increase power and crash in the wall.

I have some replacement parts and a second helicopter, but I'll try to resist flying inside again until I can get some training in a less dangerous environment. Regardless of these problems, they are really great fun :)

Thu, 22 May 2008

LinuxTag


I'll be visiting LinuxTag in Berlin next week, probably the entire day of Thursday and Friday until sometime in the afternoon. Until a couple of years ago, I used to visit LinuxTag annually, but then the quality of the presentation declined, with a lot of the topics being along the lines of "We're company XYZ and we're using Linux, yay". This year the talks look more interesting again, and I'm looking forward to meet with Harald and DaveM.

Anyone interested in meeting and discussing some netfilter or networking related topics, drop me an email, there should be plenty of time.

Fri, 16 May 2008

Netfilter move to git almost completed


I've completed the move of (most of) the netfilter repositories to git today. I still need to change the email notification script to make the commit emails more readable. They don't look very nice by default and I made it even worse. For today my limit of the amount of shell scripts I can look at is reached though.

These were the last SVN repositories I was using. I'm tempted to leave a long rant about SVN, but its probably better to simply forget about it as quickly as possible :)

Next I'll try if I can manage to roll a release candidate for iptables. We're currently releasing too infrequently. Since we're usually merging at least one new extension or revision per kernel release, there also should be one iptables release per kernel version, so users can actually use new things. The ideal time for this would be shortly before kernel releases, since that allows us to merge userspace extensions for things targeted at the next kernel release early enough so they can be used for testing. So thats what I'll try to do in the future. Luckily we didn't merge anything requiring new userspace extensions during the last merge window, so we won't need a new release for 2.6.26.

Wed, 14 May 2008

Illustrations


Apparently my blog is too boring to read, so Elena kindly offered to illustrate it. Since this spares me from writing some actual content, I gladly accepted.

What you're seing below is me resting in a deck chair, enjoying a Rothaus beer, exhaling some unidentified fumes and apparently being haunted by thoughts of ip_route_me_harder() :)

Wed, 07 May 2008

Summer office


The weather has been great the past days, so I set up my summer workplace :)

Working outside is really pleasant after a month of almost constantly grey sky. Below the balcony there's a small stream, and hundreds of birds sit in the trees and sing, which makes an amazing scenery.

I sent out a first batch of HIFN fixes today to avoid causing too much conflicts in the series in case something turns up during review. Caught a good time during which both Evgeniy and Herbert were responsive and it only took about an hour to get all patches reviewed, fix a minor bug and get them merged. The remaining ones are hopefully in shape by tommorrow, the descriptor accounting still needs a bit more work. Herbert also merged some patches from Loc Ho today for async hashing support, which is cool because I already started adding hashing support to the HIFN driver until I noticed the CrytoAPI doesn't support it asynchronously yet :)

Also sent out a few netfilter patches and fixed a slightly embarrasing bug in the macvlan driver. It would crash the kernel on module unload because cleanup was performed incorrectly, causing the kernel to jump to a NULL function pointer when receiving the next packet on the underlying device. I wonder why I've never noticed this.

Tue, 06 May 2008

Fighting the HIFN driver


What I hoped initially to be just a simple fix for a few arithmetic errors in the driver for the HIFN 795x crypto accelerator cards turned into a week long struggle, accompanied by at least a hundred crashes and reboots.

The initial bug manifested itself by going into an endless loop when the CryptoAPI issued a request for less data than the full scatterlist, caused by an integer underflow while calculating the remaining amount of data to be processed. The fix was straight-forward: only use the minimum of the scatterlist size and the crypto request size. While at it, I also fixed some endian bugs, missing error propagation for errors that shouldn't happen, but did because of the underflow, and some overly strict data alignment checks.

Testing looked good, no more crashes, but surprisingly the testcases of the tcrypt module using algorithms provided by HIFN randomly failed. This turned out to be caused by an incorrect return value indicating synchronous processing to the CryptoAPI, while the request was in fact processed asynchronously. So when the result was not already available when returning from the driver, testcases failed.

After fixing the tcrypt failures, next was some real-life testing using IPsec. The first attempt resulted in an immediate crash in crypto_authenc_genivc(). This one was fixed fairly quickly, the asynchronous completion handler interpreted a pointer as an incorrect structure.

The second attempt looked more promising, no crashes, packets went through and looked like IPsec. The remote side failed to parse them however, closer looking revealed that they were incorrectly constructed and had 16 bytes of garbage at the end. From my last attempt to fix the driver I remembered that this was most likely caused by missing initialization vector size initialisation of the CBC modes. Naively, I changed the driver to properly initialize the ivsize. To my surprise, attempting to add SAs using cbc(aes) now failed with -ENOENT.

Figuring out the reason took me almost an entire day. When the ivsize is already initialized, the CryptoAPI attempts to spawn a new instance of the algorithm. Algorithms are identified by name, possibly combined with modes, like cbc(aes). When spawning new algorithms, the driver name is used for the lookup however, which in the case of HIFN was "hifn-aes" for all AES modes, causing the lookup to return the ofb(aes) algorithm instead of cbc(aes). Using unique driver names for the different algorithm modes fixed this problem.

While chasing this bug, I noticed some DMA memory corruption issues in the HIFN driver. When a request contains more than a single scatterlist element, the driver programmed the hardware to perform one crypt operation per scatterlist element, but for the full request size, corrupting the memory after its tail. The fix for this was a bit more involved since using the correct length also requires to perform only a single operation for all scatterlist elements since the source and destination descriptors don't necessarily have identical lengths. This complicates keeping track of free descriptor entries. Previously, each operation needed exactly one command, source, destination and result descriptor. With only a single operation, it needs one command and result descriptor and a varying amount of source and destination descriptors. On the upside, this reduces the number of interrupts per request to exactly one instead of one per scatterlist element and gets rid of some atomic operations. Additionally tcrypt can now detect destination buffer corruption for cipher tests.

Continuing testing with IPsec, things now looked better, packets were properly sized and the receive side worked properly. Outgoing packets were still dropped by the receiver however. Looking more closely at the packets showed that they contained what looked like a block of unencrypted data at the end. Additionally there still were some rare random crashes in the CryptoAPI. The crashes were caused by a missing check for end-of-scatterlist in one of the CryptoAPI scatterlist helpers, the unencrypted block of data by an off-by-one in the eseqiv sequence number generator. Both problems were fixed by Herbert Xu. The first victory - IPsec now worked properly using ping. TCP connections stalled after a short period however.

Half a day later, I also figured out the reason for the stalls. The HIFN driver needs to keep some context for each request since it processes them asynchronously. The driver used the global per-transform storage for this context instead of the per-request storage, corrupting existing contexts when more than one request was outstanding. Even in flood mode, ping exhibits ping-pong behaviour, waiting for a reply before sending the next request, which is why it wasn't affected by this problem. With this also fixed, IPsec seemed to be working properly, at least on the HIFN side. There still appears to be some corruption of the XFRM CB with asynchronous processing, causing outgoing tunnel mode packets to be sent without IP_DF, but that should be easily fixed.

Next was testing with dm-crypt, for which I actually purchased the card. Testing worked fine while debugging was enabled, without debugging it reproduceably crashed in the device mapper code. This was fairly nasty to debug since enabling debugging stopped the bug from happening. After following lots of dead ends and some suggestions from Evgeniy, I found the cause: when no descriptors are currently available, the request is queued and processed once enough descriptors are available again. The queue length is limited (in the case of HIFN to 1), when the limit is reached the behaviour depends on the flags specified by the caller. When using CRYPTO_TFM_REQ_MAY_SLEEP, the caller goes to sleep and waits for notification from the driver when its ready to accept more requests. When dequeuing the crypto queue, asynchronous crypto drivers need to check for backlogged clients and wake them before continuing processing. This was missing from the HIFN driver, causing it to call the dm-crypt completion handler for a request that wasn't fully initialized.

With this bug also fixed, dm-crypt survived a 24 hour stress test. I'm a bit reluctant at this point to use it for real data though, all those bugs didn't exactly instill confidence. The patches are in an almost upstream-submittable state, just the descriptor accounting needs some minor cleanup. I hope to get this done today or tommorrow and then attend to the huge backlog in my inbox that has grown over the past week.

On the netfilter front, nothing too exciting has happened during the last two weeks. 2.6.25 appears to have gone pretty well, netfilter-wise, except for one nasty hashing regression on ARM, fixed by Philip Craig. The amount of patches merged during the 2.6.26 merge window was smaller than usual, the highlights are:

  • A large amount of SIP helper fixes and improvements
  • DCCP conntrack/NAT
  • UDP-Lite NAT
  • SCTP NAT
  • Completion of network namespace support for {ip,ip6,arp}_tables

I'm particulary happy about finally managing to merge the SIP helper patches, which I had queued for almost 9 month. If you've tried using it and it didn't work, now is a good time to try again and submit bug reports :)

Overcoming laziness


I decided to give blogging another try. My last attempt failed after just one or two entries because of me being too lazy to actually write something, but since I enjoy reading other people's blogs, I hope I can keep the motivation up a bit longer this time :)

Copyright (C) 2001-2005 Patrick McHardy