我真的羡慕自己,特别的极端崇拜,要是我拉二胡能像摆弄Linux网络那样随心所欲,我就敢请个一个月的无薪长假,去公园每天拉半天二胡...只可惜到现在还没怎么拉响。
一个多月前,我对Netfilter conntrack做了一个优化,即将conntrack分为了多个表替换现在的一个表,目的是为了提高查找的效率,这个优化是独立进行的,我希望在最新的内核版本中存在这样的优化,然而没有。但是却有一个类似的,即conntrack zone的支持,这个特性不是为了优化,它仅仅在conntrack中增加了一个键值,即zone,这样就可以将同样的conntrack或者NAT规则放在不同的zone中了。这个特性有什么用呢?在LWN上有一篇Artical上有讲到:
The attached largish patch adds support for "conntrack zones",
which are virtual conntrack tables that can be used to seperate
connections from different zones, allowing to handle multiple
connections with equal identities in conntrack and NAT.
A zone is simply a numerical identifier associated with a network
device that is incorporated into the various hashes and used to
distinguish entries in addition to the connection tuples. Additionally
it is used to seperate conntrack defragmentation queues. An iptables
target for the raw table could be used alternatively to the network
device for assigning conntrack entries to zones.
This is mainly useful when connecting multiple private networks using
the same addresses (which unfortunately happens occasionally) to pass
the packets through a set of veth devices and SNAT each network to a
unique address, after which they can pass through the "main" zone and
be handled like regular non-clashing packets and/or have NAT applied a
second time based f.i. on the outgoing interface.
Something like this, with multiple tunl and veth devices, each pair
using a unique zone:
<tunl0 / zone 1>
|
PREROUTING
|
FORWARD
|
POSTROUTING: SNAT to unique network
|
<veth1 / zone 1>
<veth0 / zone 0>
|
PREROUTING
|
FORWARD
|
POSTROUTING: SNAT to eth0 address
|
<eth0>
As probably everyone has noticed, this is quite similar to what you
can do using network namespaces. The main reason for not using
network namespaces is that its an all-or-nothing approach, you can‘t
virtualize just connection tracking. Beside the difficulties in
managing different namespaces from f.i. an IKE or PPP daemon running
in the initial namespace, network namespaces have a quite large
overhead, especially when used with a large conntrack table.
这是一篇很早以前的文章,我在纳闷之前怎么就没有注意到。请注意中间的那个图示,是不是和我的《单独一台机器测试OpenVPN加密隧道的问题和解决》这篇文章中的场景所一致呢?太TM像了,简直是一样的,但是解决的是不同的问题!引入conntrack zone的部分原因在于,某些类似veth的驱动程序的xmit并不会清除skb附着的conntrack结构体,毕竟网卡驱动和conntrack是两个独立的内核模块,二者没有必要联动,因此不要指望每一个xmit函数都会清除skb附着的额外结构体。此时zone就起到了一个过渡的作用,背后的牢骚是:我无法保证你清除conntrack,但是我自己可以替换掉它。
以前的conntrack仅仅根据tuple来对应,每一个net namespace中仅仅维护一张依靠tuple作为查找键的conntrack表,一个skb如果在veth的一端已经附着了一个conntrack,加之veth驱动没有清除它,那么在另一端,该skb将保持这个conntrack,管理员除了可以notrack之外无法对它进行任何干预,完全由conntrack模块内部自动维护。conntrack zone的引入,使得你可以在raw表中为一个skb附着一个zone id为特定值的conntrack模板,该模板的zone id指示了接下来在conntrack查找时的zone,仅此而已,目前的mainline实现中,表还是一张表,只是多了一个键值。在我自己的实现中,分裂成了多张表。
多了一个zone键值后,管理员可以从外部干预skb的conntrack了。抛开实现的方法,你可以认为你可以在iptables的raw表中为一个数据包设置一个zone id,该skb关联的conntrack在该zone中查找,在实现上,你可以将zone仅仅作为一个键值,也可以将其作为表索引引入多张表(为什么不呢?内存是问题吗?不是已经有touch highuser的办法了吗?)。
二胡没拉响,问题一大堆,track or notrack?