VXLAN Deep Dive

http://www.definethecloud.net/vxlan-deep-dive/

I’ve been spending my free time digging into network virtualization and network overlays.  This is part 1 of a 2 part series, part 2 can be found here: http://www.definethecloud.net/vxlan-deep-divepart-2.  By far the most popular virtualization technique in the data center is VXLAN.  This has as much to do with Cisco and VMware backing the technology as the tech itself.  That being said VXLAN is targeted specifically at the data center and is one of many similar solutions such as: NVGRE and STT.)  VXLAN’s goal is allowing dynamic large scale isolated virtual L2 networks to be created for virtualized and multi-tenant environments.  It does this by encapsulating frames in VXLAN packets.  The standard for VXLAN is under the scope of the IETF NVO3 working group.

The VXLAN encapsulation method is IP based and provides for a virtual L2 network.  With VXLAN the full Ethernet Frame (with the exception of the Frame Check Sequence: FCS) is carried as the payload of a UDP packet.  VXLAN utilizes a 24-bit VXLAN header, shown in the diagram, to identify virtual networks.  This header provides for up to 16 million virtual L2 networks.

Frame encapsulation is done by an entity known as a VXLAN Tunnel Endpoint (VTEP.)  A VTEP has two logical interfaces: an uplink and a downlink.  The uplink is responsible for receiving VXLAN frames and acts as a tunnel endpoint with an IP address used for routing VXLAN encapsulated frames.  These IP addresses are infrastructure addresses and are separate from the tenant IP addressing for the nodes using the VXLAN fabric.  VTEP functionality can be implemented in software such as a virtual switch or in the form a physical switch.

VXLAN frames are sent to the IP address assigned to the destination VTEP; this IP is placed in the Outer IP DA.  The IP of the VTEP sending the frame resides in the Outer IP SA.  Packets received on the uplink are mapped from the VXLAN ID to a VLAN and the Ethernet frame payload is sent as an 802.1Q Ethernet frame on the downlink.  During this process the inner MAC SA and VXLAN ID is learned in a local table.  Packets received on the downlink are mapped to a VXLAN ID using the VLAN of the frame.  A lookup is then performed within the VTEP L2 table using the VXLAN ID and destination MAC; this lookup provides the IP address of the destination VTEP.  The frame is then encapsulated and sent out the uplink interface.

Using the diagram above for reference a frame entering the downlink on VLAN 100 with a destination MAC of 11:11:11:11:11:11 will be encapsulated in a VXLAN packet with an outer destination address of 10.1.1.1.  The outer source address will be the IP of this VTEP (not shown) and the VXLAN ID will be 1001.

In a traditional L2 switch a behavior known as flood and learn is used for unknown destinations (i.e. a MAC not stored in the MAC table.  This means that if there is a miss when looking up the MAC the frame is flooded out all ports except the one on which it was received.  When a response is sent the MAC is then learned and written to the table.  The next frame for the same MAC will not incur a miss because the table will reflect the port it exists on.  VXLAN preserves this behavior over an IP network using IP multicast groups.

Each VXLAN ID has an assigned IP multicast group to use for traffic flooding (the same multicast group can be shared across VXLAN IDs.)  When a frame is received on the downlink bound for an unknown destination it is encapsulated using the IP of the assigned multicast group as the Outer DA; it’s then sent out the uplink.  Any VTEP with nodes on that VXLAN ID will have joined the multicast group and therefore receive the frame.  This maintains the traditional Ethernet flood and learn behavior.

VTEPs are designed to be implemented as a logical device on an L2 switch.  The L2 switch connects to the VTEP via a logical 802.1Q VLAN trunk.  This trunk contains an VXLAN infrastructure VLAN in addition to the production VLANs.  The infrastructure VLAN is used to carry VXLAN encapsulated traffic to the VXLAN fabric.  The only member interfaces of this VLAN will be VTEP’s logical connection to the bridge itself and the uplink to the VXLAN fabric.  This interface is the ‘uplink’ described above, while the logical 802.1Q trunk is the downlink.

Summary

VXLAN is a network overlay technology design for data center networks.  It provides massively increased scalability over VLAN IDs alone while allowing for L2 adjacency over L3 networks.  The VXLAN VTEP can be implemented in both virtual and physical switches allowing the virtual network to map to physical resources and network services.  VXLAN currently has both wide support and hardware adoption in switching ASICS and hardware NICs, as well as virtualization software.

In part one of this post I covered the basic theory of operations and functionality of VXLAN (http://www.definethecloud.net/vxlan-deep-dive.)  This post will dive deeper into how VXLAN operates on the network.

Let’s start with the basic concept that VXLAN is an encapsulation technique.  Basically the Ethernet frame sent by a VXLAN connected device is encapsulated in an IP/UDP packet.  The most important thing here is that it can be carried by any IP capable device.  The only time added intelligence is required in a device is at the network bridges known as VXLAN Tunnel End-Points (VTEP) which perform the encapsulation/de-encapsulation.  This is not to say that benefit can’t be gained by adding VXLAN functionality elsewhere, just that it’s not required.

Providing Ethernet Functionality on IP Networks:

As discussed in Part 1, the source and destination IP addresses used for VXLAN are the Source VTEP and destination VTEP.  This means that the VTEP must know the destination VTEP in order to encapsulate the frame.  One method for this would be a centralized controller/database.  That being said VXLAN is implemented in a decentralized fashion, not requiring a controller.  There are advantages and drawbacks to this.  While utilizing a centralized controller would provide methods for address learning and sharing, it would also potentially increase latency, require large software driven mapping tables and add network management points.  We will dig deeper into the current decentralized VXLAN deployment model.

VXLAN maintains backward compatibility with traditional Ethernet and therefore must maintain some key Ethernet capabilities.  One of these is flooding (broadcast) and ‘Flood and Learn behavior.’  I cover some of this behavior here (http://www.definethecloud.net/data-center-101-local-area-network-switching)  but the summary is that when a switch receives a frame for an unknown destination (MAC not in its table) it will flood the frame to all ports except the one on which it was received.  Eventually the frame will get to the intended device and a reply will be sent by the device which will allow the switch to learn of the MACs location.  When switches see source MACs that are not in their table they will ‘learn’ or add them.

VXLAN is encapsulating over IP and IP networks are typically designed for unicast traffic (one-to-one.)  This means there is no inherent flood capability.  In order to mimic flood and learn on an IP network VXLAN uses IP multi-cast.  IP multi-cast provides a method for distributing a packet to a group.  This IP multi-cast use can be a contentious point within VXLAN discussions because most networks aren’t designed for IP multi-cast, IP multi-cast support can be limited, and multi-cast itself can be complex dependent on implementation.

Within VXLAN each VXLAN segment ID will be subscribed to a multi-cast group.  Multiple VXLAN segments can subscribe to the same ID, this minimizes configuration but increases unneeded network traffic.  When a device attaches to a VXLAN on a VTEP that was not previously in use, the VXLAN will join the IP multi-cast group assigned to that segment and start receiving messages.

In the diagram above we see the normal operation in which the destination MAC is known and the frame is encapsulated in IP using the source and destination VTEP address.  The frame is encapsulated by the source VTEP, de-encapsulated at the destination VTEP and forwarded based on bridging rules from that point.  In this operation only the destination VTEP will receive the frame (with the exception of any devices in the physical path, such as the core IP switch in this example.)

In the example above we see an unknown MAC address (the MAC to VTEP mapping does not exist in the table.)  In this case the source VTEP encapsulates the original frame in an IP multi-cast packet with the destination IP of the associated multicast group.  This frame will be delivered to all VTEPs participating in the group.  VTEPs participating in the group will ideally only be VTEPs with connected devices attached to that VXLAN segment.  Because multiple VXLAN segments can use the same IP multicast group this is not always the case.  The VTEP with the connected device will de-encapsulate and forward normally, adding the mapping from the source VTEP if required.  Any other VTEP that receives the packet can then learn the source VTEP/MAC mapping if required and discard it. This process will be the same for other traditionally flooded frames such as ARP, etc.  The diagram below shows the logical topologies for both traffic types discussed.

As discussed in Part 1 VTEP functionality can be placed in a traditional Ethernet bridge.  This is done by placing a logical VTEP construct within the bridge hardware/software.  With this in place VXLANs can bridge between virtual and physical devices.  This is necessary for physical server connectivity, as well as to add network services provided by physical appliances.  Putting it all together the diagram below shows physical servers communicating with virtual servers in a VXLAN environment.  The blue links are traditional IP links and the switch shown at the bottom is a standard L3 switch or router.  All traffic on these links is encapsulated as IP/UDP and broken out by the VTEPs.

Summary:

VXLAN provides backward compatibility with traditional VLANs by mimicking broadcast and multicast behavior through IP multicast groups.  This functionality provides for decentralized learning by the VTEPs and negates the need for a VXLAN controller.

VXLAN Deep Dive

时间: 2024-08-29 11:00:33

VXLAN Deep Dive的相关文章

A Deep Dive into Recurrent Neural Nets

A Deep Dive into Recurrent Neural Nets Last time, we talked about the traditional feed-forward neural net and concepts that form the basis of deep learning. These ideas are extremely powerful! We saw how feed-forward convolutional neural networks hav

X64 Deep Dive

zhuan http://www.codemachine.com/article_x64deepdive.html X64 Deep Dive This tutorial discusses some of the key aspects of code execution on the X64 CPU like compiler optimizations, exception handling, parameter passing and parameter retrieval and sh

Deep Dive 3 - NIO

我们来继续80后Deep Dive 3 - NIO. Java NIO(New IO)是Java 1.4中引入的,时间的话已经是2002年了,确实久远.NIO的全称是New IO,作者偷懒就直接称之为NIO, 反而听起来酷酷的. 1. NIO 之父 老样子,我们先看看NIO的作者,NIO之父Mark Reinhold. Mark大叔毕业于MIT Ph.D.,老SUN员工了.大叔1996年加入Sun,一待就是13年,全心开发Java,后来Oracle于2010年收购SUN,被迫变成Oracle员工

《Docker Deep Dive》Note - Docker 引擎

<Docker Deep Dive>Note Docker 引擎 1. 概览 graph TB A(Docker client) --- B(daemon) subgraph Docker 引擎 B --- C(containerd) C --- D(runc) end Docker 引擎是用来运行和管理容器的核心软件. 主要构成:Docker Client.Docker daemon(Docker守护进程).containerd.runc. 2. 详解 graph TB A(Docker c

《Docker Deep Dive》Note - 纵观 Docker

<Docker Deep Dive>Note 由于GFW的隔离,国内拉取镜像会报TLS handshake timeout的错误:需要配置 registry-mirrors 为国内源解决这个问题. 可以配置为阿里的加速源:https://cr.console.aliyun.com/undefined/instances/mirrors,阿里的加速器可以提升获取Docker官方镜像的速度. 登录开发者账号后,将自己的加速器地址复制到 Docker Settings > Daemon >

SQL optimizer -Query Optimizer Deep Dive

refer: http://sqlblog.com/blogs/paul_white/archive/2012/04/28/query-optimizer-deep-dive-part-1.aspx    SQL是一种结构化查询语言规范,它从逻辑是哪个描述了用户需要的结果,而SQL服务器将这个逻辑需求描述转成能执行的物理执行计划,从而把结果返回给用户.将逻辑需求转换成一个更有效的物理执行计划的过程,就是优化的过程. 执行SQL的过程: Input Tree We start by looking

DM9000驱动移植在mini2440(linux2.6.29)和FS4412(linux3.14.78)上的实现(deep dive)篇一

关于dm9000的驱动移植分为两篇,第一篇在mini2440上实现,基于linux2.6.29,也成功在在6410上移植了一遍,和2440非常类似,第二篇在fs4412(Cortex A9)上实现,基于linux3.14.78,用设备树匹配,移植过程中调试和整体理解很重要,一路上幸有良师益友指点,下面详细介绍: 1.物理时序分析相关 DM9000芯片是DAVICOM公司生产的一款以太网处理芯片,提供一个通用的处理器接口.一个10/100M自适应的PHY芯片和4K双字的SRAM.内部框架如下,涉及

SPI在linux3.14.78 FS_S5PC100(Cortex A8)和S3C2440上驱动移植(deep dive)

由于工作的原因,对SPI的理解最为深刻,也和SPI最有感情了,之前工作都是基于OSEK操作系统上进行实现,也在US/OS3上实现过SPI驱动的实现和测试,但是都是基于基本的寄存器操作,没有一个系统软件架构的思想,感觉linux SPI驱动很强大,水很深,废话少说,SPI总线上有两类设备:一类是主机端,通常作为SOC系统的一个子模块出现,比如很多嵌入式MPU中都常常包含SPI模块.一类是从机被控端,例如一些SPI接口的Flash.传感器等等.主机端是SPI总线的控制者,通过使用SPI协议主动发起S

A Deep Dive into PL/v8

Back in August, Compose.io announced the addition of JavaScript as an internal language for all new PostgreSQL deployments. This was thanks to the PL/v8 project, which straps Google's rocket of a JavaScript engine (V8) to PostgreSQL. This got me thin