About Subscribe Categories Archives |
Martin Josefssons netfilter blogFri, 11 Aug 2006Refcount problem fixed
I've finally found the refcount problem after adding tons of printk's everywhere. It turns out that the event cache increases the refcount for each packet in the L3 protocol handler, and then it decreases it in nf_conntrack_confirm(), this refcount isn't coupled to the skb in question as the "regular" refcount is which means that "this" refcount isn't decreased automagically when the skb is freed. When we register the hooks we pass in an array of 'struct nf_hook_ops' and then we loop over the array calling nf_register_hook() for each hook registration. nf_register_hook() calls spin_lock_bh()/spin_unlock_bh(), spin_unlock_bh() executes any pending softirqs after it unlocks the lock. This means that we can register nf_conntrack_in() at PREROUTING and during this hook registration we get an interrupt from the NIC that schedules an softirq to be executed as soon as possible. When the hook registration of nf_conntrack_in() at PREROUTING is complete and the spin_unlock_bh() is called the pending softirq is executed. And now we just passed a packet through nf_conntrack_in() which calls the event cache which increases the refcount of the conntrack entry. But nf_conntrack_confirm() wasn't registered yet so the event cache never decreased the refcount since the packet never passed through nf_conntrack_confirm(). One workaround could be to extend the spin_lock()/spin_unlock() to cover the entire nf_conntrack_hooks(), that would prevent the above scenario from occuring... but only on UP kernels. On SMP kernels one cpu could be going through the hooks at the same time as we are registering/unregistering them as we are using RCU for the linked lists of hooks. So that's not a solution. A better solution is to make sure that we order the hook registrations in the 'struct nf_hook_ops' array in the order the hooks will be called in the net stack. And then register them in reverse order, we still unregister them in the forward order. This means that nf_conntrack_confirm() is registered before nf_conntrack_in(), and unregistered after, and after each unregistration we call synchronize_net() to make sure that all packets currently in the net stack are finished before we continue to unregister the next hook. We must make sure that a packet that passed through nf_conntrack_in() completes its journey through the net stack before we unregister the nf_conntrack_confirm() hook, thus the synchronization. Registering in reverse order and unregistering in forward order means that it's possible that packets for example are passed through nf_conntrack_confirm() without having been passed through nf_conntrack_in(), this is the reverse of the original problem, but it's much easier to handle :) We need to handle cases like this where skbs are supposed to have been passed through an earlier hook but that hasn't happened because of the racy registration and unregistration. I've implemented the above solution and so far it has passed my testcase, consisting of unloading/loading nf_conntrack_ipv4 in a loop, without problems. Without the fix it's very easy to trigger the refcount problem, even without my RCU patches. I've been able to trigger it with as few as 3 icmp echo requests while reloading the module in a loop, sometimes up to 20 packets are needed. I've only tested the fix in QEMU but I hope it also works for real SMP machines. And I need to implement the same fix for the ipv6 part as well. Sometime in the future we'll hopefully be able to get rid of the extra refcounts the event cache brings, time will tell. Now it's time for some sleep. Sat, 05 Aug 2006Sneaky refcount
I've finally got qemu working as I want so I can start to test my RCU patches without having to reboot my laptop all the time. I'm having a weird refcount problem with the L3/L4 RCU patch, I sometimes get a conntrack entry that has an elevated refcount which never drops down to zero. Very annoying. Steps to reproduce: ping the testmachine and run while true; do rmmod nf_conntrack_ipv4 ; modprobe nf_conntrack_ipv4; done on the testmachine This results in the rmmod not returning and an entry with use=1 in /proc/net/nf_conntrack_deleted. All entries to be deleted as a result of the forced kill are added to this linked-list after beeing removed from the hashtable, and we wait until they are all dead since they contain pointers to the L3/L4 protocol handlers that we are about to unload. They die and get removed from this deleted linked-list when the refcount, use, drops to zero. But this never happens. And the counters of the entry always say that there's only been traffic in the ORIG direction, and since it was an icmp echo-request and that 'ping' reports that it has received all reponses we know that the packet must have passed through the stack properly which should have decreased the refcount when the packet was kfree_skb()'d. I've seen it with tcp packets as well, but it's easier to verify that all packets got through with icmp. We have three diffrent "users" of refcounts. The timer for the conntrack entry holds one reference. Each packet passing through the conntrack infrastructure holds one reference. And we sometimed manually increase and decrease the refcount when we want to force an entry to stick around for an extended period of time (some of these forced refcount increases/decreases might go away with the use of RCU for the actual entries later, but that's another story). Testing is performed on an UP kernel without preemption and without the -rt patches which means that there's no preemption of the softirq going on. Everything should be serialized and pretty but something somewhere increases the refcount without decreasing it, the L3/L4 patch decreases the refcount when it forcibly kills the timer for the entry so that should be ok. Maybe I'll see more clearly after a beer... Tue, 01 Aug 2006First entry in 9.5 months
It's been a long time since my last entry, about 9.5 months. So what has happened since then? Not much. I moved to a new apartment 6 months ago, still looks a little like I recently moved in but that is starting to change to the better. I havn't been hacking very much on netfilter, or on anything lately, I implemented a small hackish mysql-backed dhcp-server in perl but that's about it. I've been on vacation for 3 weeks and I've rediscovered the joy of hacking. I've been working on cleaning up nf_conntrack a bit in preparation for using regular spinlocks instead of rwlocks, and then moving on to RCU. So far I've split nf_conntrack_core.c into a few smaller files since it was fairly large at 1700 lines, replaced the rwlock with spinlock, use RCU for l3/l4 protocol handlers and helpers. And various other cleanups and minor optimizations. The goal is to use RCU for the hashtable as well and only use a few atomic operations in the fast path, currently we have a truckload of atomic operations in there. Sat, 15 Oct 2005Hashtrie goes kernel
I now have hashtrie in nf_conntrack compiling, it is still untested and testing will have to wait until tomorrow. There are still some unresolved issues, like the unconfirmed list, that list used the normal list_head that was used for the hashtable when the entry wasn't added to the hashtable. This has to change now since list_head isn't used any more. The old way of implementing the unconfirmed list was bad anyway since it was a global list which was modified twice for each new assured connection. The other main issues are some refcounting and locking problems and I have to implement a way to get conntrack dumping working. Sun, 09 Oct 2005Travelling home
Today was a fairly uneventful day travelling back home. Hacking day 2
The second day of hacking started out a bit sluggish but it picked up speed later in the day.
We spent the day at the hotel all crammed into one hotel room. Our biggest problem was the
fact that the WiFi at the hotel was extremely unstable, to the point where it was often unusable. Hacking day 1
The formal workshop is over but we have two more days of hacking planned.
This first day we tried to divide ourselves up into small groups where each group
works/discusses one area, like {nf,ct}netlink, nfhipac, tcpwindowtracking etc...
I mostly experimented with the hashtrie but I also attended the nfhipac discussions
and I have to say that nfhipac is going to kick butt when the proposed changes are made.
These changes adds a generic userspace to kernelspace format based on nfnetlink that is
going to be documented so you actually can have diffrent userspace applications to manage
rules, and you can even have diffrent filtering backends in the kernel. We just don't want
to paint ourselves into a corner, we like to think we've learned from previous mistakes.
Another really nice result of this discussion is a new way to pass the data needed
for matches between userspace and kernelspace. Currently that is performed by passing structs
around, which has a lot of problems, one problem is the 64bit kernel and 32bit userspace issue
which becomes a problem when you have things like pointers in the structs. Another problem is
when you want to add more members to the struct, then the size of the struct definition in
an old kernel and in the new userspace library doesn't match anymore, thus we've broken
backwards compatibility which just isn't allowed. The new idea is to pass this data around
with netlink TLVs and then you build your internal representation from these TLVs and
then the other way around when you list the rules. Second day of the Netfilter Workshop
We had lots of nice talks and discussions today as well.
The workshop has come up with possible solutions for many current problems and issues.
Hopefully many of them will result in patches :) First day of Netfilter Workshop
This was the first day of the Netfilter Workshop. Really nice to meet all the
netfilter hackers again. Lots of nice talks, I gave a small half-improvised talk
about a datastructure I'm working on, a hashtrie to be used for connectiontracking.
I showed some performance numbers comparing the regular hashtable of ip_conntrack
to itself with diffrent configurations. And I showed some results comparing the
regular hashtable with the hashtrie. Today was the day for the trip to Seville for this years Netfilter Workshop.
All travelarrangements was made several months ago, I just forgot one small detail...
To include the strike of the french air traffic controllers in my calculations.
Because of them my first flight from Copenhagen to Madrid was delayed which lead to me
missing the second connecting flight. Getting a new boardingcard for the next flight
wasn't a problem, but the next flight wasn't supposed to depart for another 1.5 hours.
The monitors said it was delayed 45 minutes so I walked to the gate in order to
rest and maybe finish my presentation I was supposed to give at the NFWS2005.
As I'm walking to the gate I discover a weird thing, the Madrid airport has designated
smoking areas, the problem is that those areas aren't separated from the other areas by
anything other than some blue tape on the floor. I couldn't see any extra ventilation
in the roof above these smoking areas either. I now have a blog!
Since Harald has installed blosxom on people.netfilter.org, I now have a blog |