netfilter project logo developer blogs

Martin Josefssons netfilter blog

Fri, 11 Aug 2006

Refcount problem fixed

I've finally found the refcount problem after adding tons of printk's everywhere. It turns out that the event cache increases the refcount for each packet in the L3 protocol handler, and then it decreases it in nf_conntrack_confirm(), this refcount isn't coupled to the skb in question as the "regular" refcount is which means that "this" refcount isn't decreased automagically when the skb is freed.

When we register the hooks we pass in an array of 'struct nf_hook_ops' and then we loop over the array calling nf_register_hook() for each hook registration. nf_register_hook() calls spin_lock_bh()/spin_unlock_bh(), spin_unlock_bh() executes any pending softirqs after it unlocks the lock.

This means that we can register nf_conntrack_in() at PREROUTING and during this hook registration we get an interrupt from the NIC that schedules an softirq to be executed as soon as possible. When the hook registration of nf_conntrack_in() at PREROUTING is complete and the spin_unlock_bh() is called the pending softirq is executed. And now we just passed a packet through nf_conntrack_in() which calls the event cache which increases the refcount of the conntrack entry. But nf_conntrack_confirm() wasn't registered yet so the event cache never decreased the refcount since the packet never passed through nf_conntrack_confirm().

One workaround could be to extend the spin_lock()/spin_unlock() to cover the entire nf_conntrack_hooks(), that would prevent the above scenario from occuring... but only on UP kernels. On SMP kernels one cpu could be going through the hooks at the same time as we are registering/unregistering them as we are using RCU for the linked lists of hooks. So that's not a solution.

A better solution is to make sure that we order the hook registrations in the 'struct nf_hook_ops' array in the order the hooks will be called in the net stack. And then register them in reverse order, we still unregister them in the forward order. This means that nf_conntrack_confirm() is registered before nf_conntrack_in(), and unregistered after, and after each unregistration we call synchronize_net() to make sure that all packets currently in the net stack are finished before we continue to unregister the next hook. We must make sure that a packet that passed through nf_conntrack_in() completes its journey through the net stack before we unregister the nf_conntrack_confirm() hook, thus the synchronization.

Registering in reverse order and unregistering in forward order means that it's possible that packets for example are passed through nf_conntrack_confirm() without having been passed through nf_conntrack_in(), this is the reverse of the original problem, but it's much easier to handle :) We need to handle cases like this where skbs are supposed to have been passed through an earlier hook but that hasn't happened because of the racy registration and unregistration.

I've implemented the above solution and so far it has passed my testcase, consisting of unloading/loading nf_conntrack_ipv4 in a loop, without problems. Without the fix it's very easy to trigger the refcount problem, even without my RCU patches. I've been able to trigger it with as few as 3 icmp echo requests while reloading the module in a loop, sometimes up to 20 packets are needed. I've only tested the fix in QEMU but I hope it also works for real SMP machines. And I need to implement the same fix for the ipv6 part as well.

Sometime in the future we'll hopefully be able to get rid of the extra refcounts the event cache brings, time will tell. Now it's time for some sleep.

Sat, 05 Aug 2006

Sneaky refcount

I've finally got qemu working as I want so I can start to test my RCU patches without having to reboot my laptop all the time. I'm having a weird refcount problem with the L3/L4 RCU patch, I sometimes get a conntrack entry that has an elevated refcount which never drops down to zero. Very annoying.

Steps to reproduce: ping the testmachine and run while true; do rmmod nf_conntrack_ipv4 ; modprobe nf_conntrack_ipv4; done on the testmachine

This results in the rmmod not returning and an entry with use=1 in /proc/net/nf_conntrack_deleted. All entries to be deleted as a result of the forced kill are added to this linked-list after beeing removed from the hashtable, and we wait until they are all dead since they contain pointers to the L3/L4 protocol handlers that we are about to unload. They die and get removed from this deleted linked-list when the refcount, use, drops to zero. But this never happens. And the counters of the entry always say that there's only been traffic in the ORIG direction, and since it was an icmp echo-request and that 'ping' reports that it has received all reponses we know that the packet must have passed through the stack properly which should have decreased the refcount when the packet was kfree_skb()'d. I've seen it with tcp packets as well, but it's easier to verify that all packets got through with icmp.

We have three diffrent "users" of refcounts. The timer for the conntrack entry holds one reference. Each packet passing through the conntrack infrastructure holds one reference. And we sometimed manually increase and decrease the refcount when we want to force an entry to stick around for an extended period of time (some of these forced refcount increases/decreases might go away with the use of RCU for the actual entries later, but that's another story).

Testing is performed on an UP kernel without preemption and without the -rt patches which means that there's no preemption of the softirq going on. Everything should be serialized and pretty but something somewhere increases the refcount without decreasing it, the L3/L4 patch decreases the refcount when it forcibly kills the timer for the entry so that should be ok.

Maybe I'll see more clearly after a beer...

Tue, 01 Aug 2006

First entry in 9.5 months

It's been a long time since my last entry, about 9.5 months. So what has happened since then? Not much. I moved to a new apartment 6 months ago, still looks a little like I recently moved in but that is starting to change to the better.

I havn't been hacking very much on netfilter, or on anything lately, I implemented a small hackish mysql-backed dhcp-server in perl but that's about it. I've been on vacation for 3 weeks and I've rediscovered the joy of hacking. I've been working on cleaning up nf_conntrack a bit in preparation for using regular spinlocks instead of rwlocks, and then moving on to RCU. So far I've split nf_conntrack_core.c into a few smaller files since it was fairly large at 1700 lines, replaced the rwlock with spinlock, use RCU for l3/l4 protocol handlers and helpers. And various other cleanups and minor optimizations. The goal is to use RCU for the hashtable as well and only use a few atomic operations in the fast path, currently we have a truckload of atomic operations in there.

Sat, 15 Oct 2005

Hashtrie goes kernel

I now have hashtrie in nf_conntrack compiling, it is still untested and testing will have to wait until tomorrow. There are still some unresolved issues, like the unconfirmed list, that list used the normal list_head that was used for the hashtable when the entry wasn't added to the hashtable. This has to change now since list_head isn't used any more. The old way of implementing the unconfirmed list was bad anyway since it was a global list which was modified twice for each new assured connection. The other main issues are some refcounting and locking problems and I have to implement a way to get conntrack dumping working.

Sun, 09 Oct 2005

Travelling home

Today was a fairly uneventful day travelling back home.

During the second flight from Madrid to Copenhagen I came up with a possible way to modify the hashtrie to do longest prefix matching. This needs a lot more thought to make sure it could actually work. Then comes the implementation which I fear is going to become a bit tricky. It may end up a very bad idea... time will tell.

I unfortunately cought a cold so when I got back home I had a sore throat and a bit of a fever.

Sat, 08 Oct 2005

Hacking day 2

The second day of hacking started out a bit sluggish but it picked up speed later in the day. We spent the day at the hotel all crammed into one hotel room. Our biggest problem was the fact that the WiFi at the hotel was extremely unstable, to the point where it was often unusable.
In the evening we went out for yet another wonderful meal. Then it was time to say goodbye to everyone as I'm going to leave early in the morning.

Fri, 07 Oct 2005

Hacking day 1

The formal workshop is over but we have two more days of hacking planned. This first day we tried to divide ourselves up into small groups where each group works/discusses one area, like {nf,ct}netlink, nfhipac, tcpwindowtracking etc... I mostly experimented with the hashtrie but I also attended the nfhipac discussions and I have to say that nfhipac is going to kick butt when the proposed changes are made. These changes adds a generic userspace to kernelspace format based on nfnetlink that is going to be documented so you actually can have diffrent userspace applications to manage rules, and you can even have diffrent filtering backends in the kernel. We just don't want to paint ourselves into a corner, we like to think we've learned from previous mistakes. Another really nice result of this discussion is a new way to pass the data needed for matches between userspace and kernelspace. Currently that is performed by passing structs around, which has a lot of problems, one problem is the 64bit kernel and 32bit userspace issue which becomes a problem when you have things like pointers in the structs. Another problem is when you want to add more members to the struct, then the size of the struct definition in an old kernel and in the new userspace library doesn't match anymore, thus we've broken backwards compatibility which just isn't allowed. The new idea is to pass this data around with netlink TLVs and then you build your internal representation from these TLVs and then the other way around when you list the rules.

We had nothing planned for the evening so we went back to the hotel to leave all the hardware and then just go out somewhere to eat and drink some beer. 15 of us went out barhopping. Later in the evening we ended up at a fairly small street, I have to say I've never ever seen so many people in one place before. The reason for this was that there's 4 bars located within a total distance of 10 meters. We had a really good time there but when we finally decided to head back to the hotel we noticed that Pablo, who knew the way back to the hotel, was missing. We ended up walking for a while but we eventually found the hotel.

Thu, 06 Oct 2005

Second day of the Netfilter Workshop

We had lots of nice talks and discussions today as well. The workshop has come up with possible solutions for many current problems and issues. Hopefully many of them will result in patches :)

The pub/restaurant we ended up at in the evening was a bit interesting. Instead of tables and chairs it had kind of a bleacher were everyone sat, drank beer and ate their food. Me and Harald yet again failed to go to sleep when we got back to the hotel, once again we ended up discussing a lot of diffrent topics, including why it is that noone seems to have implemented an open software stack for the AC97 compatible winmodems present in almost all newer laptops. That discussion ended up discussing signal processing and other weird things people have done that makes writing a stack for these modems sound not all that difficult (that is, if you know what you are doing, which I don't :)
I've been hacking on the hashtrie during the day, deletes are now a lot faster and forced eviction of certain entries (with a special status) is implemented and it appears to be really fast as well. But this new feature will have to undergo a lot of testing to make sure we're not ending up evicting the wrong entries, agewise that is. Somehow I keep thinking that if I implement something new and it turns out it's fast, it must be broken in some way.

I also got some more free tshirts, which is always welcome :)

Wed, 05 Oct 2005

First day of Netfilter Workshop

This was the first day of the Netfilter Workshop. Really nice to meet all the netfilter hackers again. Lots of nice talks, I gave a small half-improvised talk about a datastructure I'm working on, a hashtrie to be used for connectiontracking. I showed some performance numbers comparing the regular hashtable of ip_conntrack to itself with diffrent configurations. And I showed some results comparing the regular hashtable with the hashtrie.

Later in the evening we went out for dinner, that was really nice. I think there were around 20 people attending the dinner resulting in lots of discussions about many netfilter related issues. Some of us went back to the hotel early in order to get some well needed sleep, that failed as expected, but it was worth a try. Me and Harald ended up talking about a lot of diffrent things, including swedish and german food for a while instead so we didn't get more sleep anyway.

Tue, 04 Oct 2005

Today was the day for the trip to Seville for this years Netfilter Workshop.

All travelarrangements was made several months ago, I just forgot one small detail... To include the strike of the french air traffic controllers in my calculations. Because of them my first flight from Copenhagen to Madrid was delayed which lead to me missing the second connecting flight. Getting a new boardingcard for the next flight wasn't a problem, but the next flight wasn't supposed to depart for another 1.5 hours. The monitors said it was delayed 45 minutes so I walked to the gate in order to rest and maybe finish my presentation I was supposed to give at the NFWS2005. As I'm walking to the gate I discover a weird thing, the Madrid airport has designated smoking areas, the problem is that those areas aren't separated from the other areas by anything other than some blue tape on the floor. I couldn't see any extra ventilation in the roof above these smoking areas either.

Anyway, this new flight ended up beeing delayed over an hour. So I arrived in Seville after midnight. I met Harald at the hotel room and we had a nice chat before it was time to go to bed.

Tue, 27 Sep 2005

I now have a blog!

Since Harald has installed blosxom on, I now have a blog

Copyright (C) 2001-2005 Martin Josefsson