Linux内核协议栈 NAT性能优化之FAST NAT

各位看官非常对不起，本文是用因为写的，如果多有不便敬请见谅

代码是在商业公司编写的，在商业产品中也不能开源，再次抱歉

This presentation will highlight our efforts on optimizing the

Linux TCP/IP stack for providing networking in an

OpenStack environment, as deployed at our industrial customers.

Our primary goal is to provide a high-quality and highly performant TCP/IP stack.

To achieve this, we have to identify the performance bottlenecks in

the Linux TCP/IP stack for networking in OpenStack. We have performed a lot of

Linux TCP/IP stack performance tuning, related to NIC, CPU cache hit rate, spin lock,

memory alloc and others. However, we learned while measuring that conntrack NAT

uses too much CPU such for instance for the ipt_do_table function.

Linux conntrack is very good, but it is too heavy and many functions are not used.

Instead, we implemented FAST NAT in the Linux TCP/IP stack.

We will present our efforts on reducing the performance costs.

First, FAST NAT uses spin lock instead of global connection table but the entry to greatly reduces the CPU waiting time,

and user policies is instead stored as a hash table not a list. The connection table and user

policy is per-NUMA, this would avoid CPU through QPI waste much time and increase delay.

Second, FAST NAT does not record the TCP status,

but only record a tuple with relevant connection formation for NAT forward.

This can reduce much check for forwarding packet.

Entry in the connection table can be set to expire on

an absolute expiration time or relative expiration time basis.

Relative expiration time will incresae by per forwarding packet.

Global connection table don‘t synchronize for reducing lock‘s using. This may casue one TCP stream in

per-NUMA connection table. If we use Intel Ixgbe NIC with Flow Director ATR mode, the incoming

stream and outcoming stream will have same index for multiple queues. The mentioned limit above

will disappear.

Limitations of FAST NAT only TCP and UDP are supported.

Although some limitations exist, our work has paid off and resulted in 15-20 percentage pps improvement.

时间： 2024-12-29 15:52:09

Linux内核协议栈 NAT性能优化之FAST NAT的相关文章

Linux内核协议栈的socket查找缓存路由机制

是查路由表快呢?还是查socket哈希表快?这不是问题的根本.问题的根本是怎么有效利用这两者,让两者成为合作者而不是竞争者.这是怎么回事? 我们知道,如果一个数据包要到达本地,那么它要经过两次查找过程(暂时不考虑conntrack):IP层查找路由和传输层查找socket.怎么合并这两者. Linux内核协议栈采用了一种办法:在socket中增加一个dst字段作为缓存路由的手段,skb在查找路由之前首先查找socket,找到的话,就将缓存的dst设置到skb,接下来在查找

linux内核参数注释与优化

转自:http://yangrong.blog.51cto.com/6945369/1321594 目录 1.linux内核参数注释 2.两种修改内核参数方法 3.内核优化参数生产配置参数解释由网络上收集整理,常用优化参数对比了网上多个实际应用进行表格化整理,使查看更直观. 学习linux也有不少时间了,每次优化linux内核参数时,都是在网上拷贝而使用,甚至别人没有列出来的参数就不管了,难道我就不需要了吗? 参考文章: linux内核TCP相关参数解释 http://os.chinaunix

redmine在linux上的mysql性能优化方法与问题排查方案

iredmine的linux服务器mysql性能优化方法与问题排查方案问题定位: 客户端工具: 1. 浏览器inspect-tool的network timing工具分析 2. 浏览器查看 response header, 分析http server 与 web server. 服务器工具: 0. nmon 查看各类系统负载, rrdtool 查看网络状况. 1. uptime看cpu负载; free看内存; mem ; cat /proc/meminfo以及 i

spark内核揭秘-14-Spark性能优化的10大问题及其解决方案

问题1:reduce task数目不合适解决方案: 需要根据实际情况调整默认配置,调整方式是修改参数spark.default.parallelism.通常的,reduce数目设置为core数目的2-3倍.数量太大,造成很多小任务,增加启动任务的开销:数目太小,任务运行缓慢.所以要合理修改reduce的task数目即spark.default.parallelism 问题2:shuffle磁盘IO时间长解决方案: 设置spark.local.dir为多个磁盘,并设置磁盘的IO速度快的磁盘,通

Linux协议栈查找算法优化随想

Linux的网络协议栈实现可谓精确却不失精巧,不必说Netfilter,单单说TC就够了,但是有几处硬伤,本文做一个不完备的记录,就当是随笔,不必当真. 0.查找的种类 Linux协议栈作为一个纯软件实现,保留了硬件接口,但是本文不涉及硬件. 在Linux的协议栈实现中,由于没有硬件电路的固化,查找算法是难免的,比如路由查找,邻居查找,conntrack查找,socket查找,不一而足.事实上,协议栈作为一个公共组织,为所有的数据包服务,如果一个数据包到达协议栈,处理逻辑必须帮它找到

linux内核网络协议栈架构分析，全流程分析-干货

https://download.csdn.net/download/wuhuacai/10157233 https://blog.csdn.net/zxorange321/article/details/75676063 LINUX内核协议栈分析目录 1 说明...4 2 TCP协议...4 2.1 分层...4 2.2 TCP/IP的分层...5 2.3 互联网的地址...6 2.4 封装...7 2.5

Linux Kernel - Debug Guide (Linux内核调试指南 )

http://blog.csdn.net/blizmax6/article/details/6747601 linux内核调试指南一些前言作者前言知识从哪里来为什么撰写本文档为什么需要汇编级调试 ***第一部分:基础知识*** 总纲:内核世界的陷阱源码阅读的陷阱代码调试的陷阱原理理解的陷阱建立调试环境发行版的选择和安装安装交叉编译工具 bin工具集的使用 qemu的使用 initrd.img的原理与制作 x86虚拟调试环境的建立 arm虚拟调试环境的建立 arm开发板调试环

linux内核调试指南

linux内核调试指南一些前言作者前言知识从哪里来为什么撰写本文档为什么需要汇编级调试 ***第一部分:基础知识*** 总纲:内核世界的陷阱源码阅读的陷阱代码调试的陷阱原理理解的陷阱建立调试环境发行版的选择和安装安装交叉编译工具 bin工具集的使用 qemu的使用 initrd.img的原理与制作 x86虚拟调试环境的建立 arm虚拟调试环境的建立 arm开发板调试环境的建立 gdb基础基本命令 gdb之gui gdb技巧 gdb宏汇编基础--X86篇用户手册 AT&

linux内核数据包转发流程（三）网卡帧接收分析

[版权声明:转载请保留出处:blog.csdn.net/gentleliu.邮箱:shallnew*163.com] 每个cpu都有队列来处理接收到的帧,都有其数据结构来处理入口和出口流量,因此,不同cpu之间没有必要使用上锁机制,.此队列数据结构为softnet_data(定义在include/linux/netdevice.h中): /* * Incoming packets are placed on per-cpu queues so that * no locking is neede