netfilter project logo people.netfilter.org developer blogs

Patrick McHardy's blog

Tue, 08 Jul 2008

VLAN update


Dave just merged the second part of my VLAN update for 2.6.27. It contained nothing particularly interesting, mainly minor cleanups, uninlining, ethtool support for querying offload settings and a minor fix for incorrect header pointer adjustments with software tagging. The more interesting part is contained in the final update, which I just sent out as RFC .

There are a number of inconsistencies when using VLAN with hardware acceleration with respect to visibility on packet sockets, which these patches attempt to cure. Linux supports three modes of VLAN hardware acceleration:

  • VLAN tagging: the VLAN code passes the VLAN TCI to the driver in the skb's cb, the hardware inserts the VLAN tag using this TCI when sending the packet out.
  • VLAN stripping: the hardware strips the VLAN tag on RX and passes the TCI to the driver in the RX descriptors. The driver uses a special variant of netif_{rx,receive_skb} that takes the TCI, extracts the VLAN ID, looks up the virtual VLAN device and calls netif_{rx,receive_skb} with the device set to the VLAN device.
  • VLAN filtering: the hardware is programmed to filter out all but the locally configured VLANs.
The inconsistencies resulting from this design are:
  • With VLAN tagging, outgoing packets travel through the stack untagged. This means they appear as regular ethernet packets to tcpdump. Additionally the use of the skb's cb is wrong since other layers may use it for their private data. The netem qdisc actually does this, corrupting the VLAN TCI.
  • With VLAN stripping, incoming packets bypass packets sockets on the real device completely. Packets for locally configured VLANs are visisble on the VLAN device only, packets for unknown VLANs not at all.
  • With VLAN filtering, some drivers disable VLAN filters in promiscous mode, some don't. With those that don't, packets for unknown VLANs are not visible. Most drivers don't enable VLAN filtering until the first VLAN is configured, which is even more inconsistent.

Since running tcpdump is an exception, fixing this behaviour should not affect or disable the optimizations provided by hardware acceleration. The approach taken by my patches is to promote the VLAN TCI from the cb to a full skb member to avoid the netem corruption and keep it intact within the packet socket code, which is also using the cb itself. A new member is added to the packet socket auxillary data to store the VLAN TCI, allowing userspace to sense that a packet is actually a VLAN packet and (re)construct the VLAN tag. On RX, the hardware acceleration netif_{rx,receive_skb} wrappers store the VLAN TCI in the skb and manually invoke the ETH_P_ALL packet handlers before receiving the packet on the VLAN device. Combined with a patch for libpcap to perform the VLAN tag (re)construction, this fixes the first two issues. One minor remaining issue is that socket filters for VLAN packets don't work as intended since they expect a VLAN header. Since userspace needs to know about VLAN acceleration anyway, it seems reasonable to put the burden on userspace by providing a new filter instruction for getting the VLAN TCI from the skb's meta data and expecting it to construct the filters accordingly.

To fix the third issue, the drivers need to be modified to provide the desired semantic. Their initial state should be to filter out all VLANs. Currently most of them only enable filters when adding the first VLAN, which is clearly suboptimal since previously all VLAN packets are uninteresting, except when in promiscous mode. When adding new VLANs, the filters should be adjusted to allow their respective IDs, this is done correctly by all drivers. Finally, in promiscous mode, all filters should be disabled. I'm half way done modifying the drivers to provide this behaviour (all Intel drivers), for most of the remaining ones I'm not sure about their current behaviour since its unclear whether the promiscous mode offered by the hardware automatically disables VLAN filtering or not.

Additionally it turned out during testing that about half the drivers performing VLAN stripping didn't provide the full TCI but only the VLAN tag to the HW acceleration RX functions. The upper bits of the TCI contain the VLAN priority, which is used for ingress priority mappings so far, with my patches it also affects tcpdump visibilty. The fixes for this are already in net-next-2.6.git.

What's next
Since with these changes the skb is able to carry the VLAN TCI across layers, we are now able to provide VLAN acceleration to virtual network devices by adding a software fallback, similar to how TSO works. This would allow to use hardware tagging from within network namespaces or other virtualized environments. Additionally there are two more inconsistencies that people have been complaining about:
  • New VLAN devices can only be created when the lower device is UP. They happily continue existing when setting the lower device DOWN though. The reason for this is apparently that some drivers can't cope with having the RX filter programming callback invoked while in DOWN state. It seems the fix with the least risk for this would be to defer filter programming until the lower device is UP.
  • When setting the lower device DOWN, its VLAN devices are put in DOWN state as well, deleting all routes pointing to them. When setting the lower device UP again, the VLAN devices stay down. Even if they were automatically set UP again, the kernel can only reconstruct the automatically created routes. This one probably requires a flag day on which the behaviour will be changed, alternatively a new flag to specify the desired behaviour.
One more nice thing would be to avoid the netif_rx() scheduling overhead when receiving packets on a VLAN device, which reportedly costs around 1-5% performance. The reason for not calling netif_receive_skb() directly is to keep stack usage low. An idea that has been tossed around that would also benefit other virtual devices is to modify the packet handlers to be able to return a new packet to netif_receive_skb().

Copyright (C) 2001-2005 Patrick McHardy