A network consists of various types of network devices. In a traditional IT system, nearly all network devices are physical devices with predictable traffic. For example, to enable the hosts connected by two different switches to communicate with one another, the two switches must be connected using a network cable or optical fiber. With cloud computing, many network devices are virtual entities running inside servers. These virtual devices communicate with each other not directly through a network table, but through entries in some logical forwarding table, which poses new challenges to network administrators. This chapter describes the physical and virtual devices used in cloud environments.

Network Architecture in Virtualization

Traffic on a Virtual Network

In a cloud or virtualization-based environment, traffic can be divided into north-south traffic and east-west traffic, while a traditional IT system does not make such distinction. A reference object is required to define north-south and east-west traffic. Generally, the router acts as the demarcation point distinguishing north-south traffic from east-west traffic. The router can be a physical or virtual one. Traffic that passes through a router is north-south traffic, while traffic that does not pass through a router is east-west traffic. The following figure shows north-south and east-west traffic in a system where physical routers are deployed on the edge of an Internet Data Center (IDC). In the north, the routers are connected to an extranet, which can be the Internet or an enterprise-defined extranet. In the south, the routers are connected to business networks, such as the email system and the office system of the IDC. When the extranet accesses a business network of the IDC, north-south traffic is produced. When an employee VM in the IDC accesses a business network, east-west traffic is produced because

the traffic passes through no router.

Traffic in a cloud environment

As cloud computing develops further, even more computations are performed in clusters of the IDC. A single request initiated by a user may trigger many computing operations between IDC business networks. If a cloud desktop is adopted, the client is also deployed in the IDC. In addition, virtualization features high availability (HA), which allows VMs to migrate across physical servers in the IDC. As a result, east-west traffic has increased from the original some 20% to the current 70%. This also creates a new network architecture. The old architecture consists of the core layer, the aggregation layer, and the access layer while the new architecture is a large layer-2 system. The old 3-layer architecture is not suitable for the data exchange in a cloud environment while the large layer-2 architecture handles massive east-west traffic with ease.

The VM is also a component of the network. VMs are almost all connected to the network using network bridges. To facilitate management and configuration, the virtual switch is used in most cases. The virtual switch is an advanced network bridge and will be described in detail later. A VM can have multiple virtual NICs. Through virtual NICs, a VM can connect to one or more virtual switches.

Basic Network Concepts

Broadcast and Unicast

Broadcast and unicast are two types of network traffic. There is a third traffic type called multicast, which is not discussed in this document.

Broadcast, as its name implies, means "one speaks and all others listen". On a network, when two devices intend to communicate with each other for the first time, the initiator will broadcast packets to identify receivers. These packets are flooded across the entire broadcast domain, and all network devices in the domain will receive the packets. When a network device receives the packets, it checks their content. If the network device finds itself a target receiver, it responds to the initiator with a unicast message. If the network device finds itself not a target receiver, it discards the packets.

Broadcast is not only used for starting communication but also used for running many applications including DHCP. A service uses a specific broadcast address. For example, the broadcast address 192.168.1.255 is used for the IP address range 192.168.1.0/24, and the DHCP client uses the broadcast address 255.255.255.255 to search for DHCP servers on the initial stage.

An analogy may help you understand it better. Years ago, each village was usually equipped with a broadcast station. When the village intended to find somebody, an employee of the broadcast station would speak before the speaker, and the voice of the employee can be seen as a broadcast address. When the village intended to publish important information, the village head would speak before the speaker, and the voice of the head can be seen as a broadcast address. Each type of event was broadcast by a different person, whose voice is unique.

Unicast differs from broadcast in that one person speaks while another person listens. On a real network, unicast communication is performed between two network devices. If one sends messages and the other receives messages, the process is half duplex. If both of them send and receive messages simultaneously, the process is full duplex.

On a network, data is usually transmitted in unicast mode. For example, to send an email, visit a web page, or play an online game, you need to establish a connection with the email server, web server, or game server.

Broadcast packets usually contain only basic information, such as the source and destination addresses. Unicast packets, however, usually contain information vital to both ends. If broadcast packets are flooded over a network and the bandwidth size is limited, unicast packets will be blocked. In this case, users may find that web pages cannot be displayed, emails cannot be sent, and their game accounts go offline frequently. Broadcast packets consume extensive bandwidths and pose security risks. Specifically, broadcast packets will be received by all, and if someone uses the information carried in the packets for fraud and spoofing, it may incur information leakage or network breakdown. Broadcast is necessary because it is the necessary step to start unicast. To mitigate the risks imposed by broadcast,

network designers confine broadcast packets in a broadcast domain. Then, a question arises: Can we make a broadcast domain smaller? A smaller broadcast domain will relieve network congestion and mitigate security risks. Another question is through what devices broadcast domains communicate with each other. Next, we will consider the route and the default gateway.

Route and Default Gateway

Before mobile phones were widely used, telephones prevailed. When making a distance call using a telephone, you need to add an area code before the telephone number. To obtain an area code, you may need to consult yellow pages or other sources. An area code acts as a route, routing phone calls to the corresponding area, and the yellow pages act as a route table containing routes. If communication between two broadcast domains is like a distance call, the way a broadcast domain locates another broadcast domain is like a routing process.

When many broadcast domains exist, the route table will have a large number of entries. Each time a device attempts to communicate with another device, it needs to search for the route to the destination in the route table. This burdens the device where the route table is stored and reduces network communication efficiency. To address these issues, the default gateway is used. The default gateway works like the default route. An analogy may help you understand it better. When you need to obtain a target telephone number, you can dial 114. The default gateway provides a function similar to what 114 provides, but there is a difference. After a user dials 114 to ask for a telephone number and finally obtains the number, the user dials the number themselves. After a default gateway receives a communication request, it performs route forwarding for the initiator if its route table contains the destination address, and it returns a message of an inaccessible destination address to the initiator if its route table does not contain the destination address.

The default gateway is a special routing mechanism. It is the last choice for route forwarding, which means that the default gateway is used for forwarding when no routing entry is available for forwarding.

VLAN

Virtual local area network (VLAN) is the most common way to subdivide a physical local area network (LAN) into separate logical broadcast domains. Devices within the same VLAN can communicate with each other using their MAC addresses, which those in different VLANs cannot. Hence, broadcast packets can be propagated only within a VLAN, not across VLANs. For details, see the following figure.

VLAN functions

VLAN brings the following benefits:

Defines broadcast domains: Each VLAN is a broadcast domain. This saves bandwidths and improves network throughput.
Hardens LAN security: Packets from different VLANs are separately transmitted. Hosts in one VLAN cannot communicate with hosts in another without using IP addresses.
Improves network robustness: A fault in one VLAN does not affect hosts in other VLANs.
Flexibly defines virtual workgroups: Each VLAN defines a workgroup that can contain users from different geo-locations. This allows for flexible networking and easier maintenance.

A VLAN packet is a traditional Ethernet data frame with a four-byte 802.1Q tag inserted. This

tag uniquely identifies a VLAN. Figure 3-3 and Figure 3-4 compare an ordinary data frame

with a VLAN-tagged one.

Format of an ordinary Ethernet frame

Format of a data frame with a VLAN tag

Each 802.1Q-capable switch sends data packets carrying a VLAN ID. The VLAN ID identifies the VLAN to which the switch belongs. Ethernet frames can be divided into the following two types depending on whether they are tagged or not:

Tagged frame: Ethernet frame with a 4-byte 802.1Q tag.
Untagged frame: original Ethernet frame without a 4-byte 802.1Q tag.

Although the OS and switch are both able to add a VLAN tag to data frames, VLAN tags are usually added or removed by a switch. Therefore, a VLAN has the following types of links:

Access link: connects a host to a switch. Generally, the host does not need to know which VLAN it belongs to and the host hardware often cannot identify VLAN-tagged frames. Therefore, the host can only receive or send untagged frames.
Trunk link: connects a switch to another switch or to a router. Data of different VLANs is transmitted along a trunk link. The two ends of a trunk link must be able to distinguish frames with different VLAN tags. Therefore, only tagged frames are transmitted along trunk links.

After the 802.1Q standard defines the VLAN frame, only certain interfaces of a device can identify a VLAN frame. Based on their ability to identify VLAN frames, interfaces are divided into two types:

Access interface: The access interface resides on the switch and is used to connect the interface of the host. That is, the access interface can only connect the access link. Only packets carrying the default VLAN ID of this interface can pass through. Also, Ethernet frames sent through an access interface do not carry any tag.
Trunk interface: A trunk interface is used to connect a switch to other switches and can connect only a trunk link. A trunk interface permits the tagged frames of multiple VLANs to pass through.

Interfaces of each type can be assigned with a default VLAN ID called PVID (Port Default VLAN ID). The meaning of a default VLAN varies with the interface type. The default VLAN ID of almost all switches is 1.

The following table lists the methods of processing data frames on various interfaces.

Methods of processing data frames

The following describes how a Huawei switch processes data frames.

Configuration 1

In this configuration, the interface is an access interface and is configured with default VLAN 10. When a data frame with VLAN 10 arrives at the interface, its tag is removed and the untagged data frame is forwarded. When an untagged data frame arrives at the interface, it is tagged with VLAN 10 and is then forwarded. When a data frame with a tag different from VLAN 10 arrives at the interface, it is discarded.

Configuration 2

In this configuration, the interface is a trunk interface and is configured with default VLAN 10. This interface allows packets with VLAN 16 or VLAN 17 to pass through. When an untagged data frame arrives at the interface, it is tagged with VLAN 10. When a data frame with VLAN 10 arrives at the interface, its tag is removed and the untagged data frame is forwarded. When a data frame with VLAN 16 or VLAN 17 arrives at the interface, it is directly forwarded. When any other data frame arrives at the interface, it is discarded.

VLANs can be divided based on switch ports, MAC addresses, subnets, or policies. A VLAN can be further divided into multiple VLANs, which is an advanced VLAN function. For details, see Huawei data communication certification courses.

Physical Networks in Virtualization

In virtualization, applications run on VMs, and VMs run on physical servers. Before connecting VMs to a network, connect physical servers to the network first. To do so, the following devices are required: routers, layer-3 switches, layer-2 switches, and server NICs. Before discussing physical devices, we will learn about the four-layer TCP/IP model.

TCP/IP is the most widely used protocol stack used on the Internet. TCP/IP divides the entire communication network into four layers, namely, the application layer, transport layer, network layer, and link layer, as shown in the following figure.

TCP/IP protocol stack

The routing process works at the transport layer (layer 3), and VLAN works at the network layer (layer 2). Therefore, if a device has the routing function and can look up the routing table, it is considered a layer-3 device. If a device supports VLAN configuration only, it is considered a layer-2 device. A hub is a link-layer (layer 1) device although it appears to function as a switch does. The reason is that the hub is only like a splitter and does not support the function of dividing VLANs. Routers and layer-3 switches work at the transport layer, layer-2 switches work at layer 2, and the network cables and optical fibers of physical server NICs and link NICs work at layer 1.

Both routers and layer-3 switches work at the transport layer. Layer-3 switches have the routing function, and many routers configured with an additional switching board support the function of dividing VLANs. Then, the question is whether a layer-3 switch and a router can substitute each other. The answer is no.

First, a switch and a router provide different functions. A switch uses the dedicated chip (ASIC) for high-speed data exchange while a router maintains a route table for route addressing across IP address ranges. A router has the nature of separating broadcast domains. Even if a router gains the switching function or a switch obtains the routing function, its key function remains unchanged, with the new function only as a supplement.

Second, switches are mainly used in LANs, but routers are mainly used in WANs. The LAN features frequent data exchanges, a single network interface type, and a large number of interfaces. The switch can provide fast data forwarding, the network cable interface (RJ45), and the optical fiber interface. In addition, a switch can provide a large number of interfaces to meet the requirements of the LAN. The WAN features a variety of network types and a variety of interface types. The router provides a powerful routing function, which can be used not only between LANs using one protocol but also between LANs and WANs using multiple protocols. The WAN has the following advantages: selecting the optimal route, balancing load, backing up links, and exchanging route information with other networks. These are functions provided by the router.

Third, the performance of the layer-3 switch is different from that of the router. Technically speaking, the router and the layer-3 switch have distinct differences in data packet switching. The router uses the software engine with a micro-processor to forward data packets while the layer-3 switch uses hardware. After a layer-3 switch forwards the first packet of a data flow, it generates a mapping between MAC addresses and IP addresses. When the same data flow passes, the layer-3 switch forwards the packets without routing again. This prevents the delay caused by route selection and improves the efficiency of forwarding data packets.

In addition, the layer-3 switch searches for routes for data flow. The layer-3 switch uses the cache technology to achieve the function with ease, lowering costs dramatically and implementing fast forwarding. The router uses the longest-match mode to forward packets. This complex process is usually implemented by software, with a low forwarding efficiency. Therefore, in terms of performance, the layer-3 switch is better than the router and is applied to the LAN with frequent data exchange. With a powerful routing function and low forwarding efficiency of data packets, the router is applied to the connection of different types of networks without frequent data exchange, such as the connection between the LAN and Internet. If the router, especially the high-end router, is used on a LAN, its powerful routing function is wasted and it cannot meet the communication requirements of the LAN and adversely affects subnet communication.

In cloud computing and virtualization, routers are usually deployed at the egress of an enterprise or institution to connect the intranet to the Internet. When an intranet user intends to access the Internet, the router will perform route forwarding and network address translation (NAT). When an Internet user intends to access the intranet, the traffic will also pass through the router.

The router can allow the intranet to communicate with the Internet, which means that one issue is addressed. The other issue is how to plan the intranet. Our purpose is to connect physical servers to the network. In the equipment room of a data center, servers are placed in racks. For better cooling, racks are arranged in rows. Therefore, two popular ways for connecting servers to a network are available, which are Top of Rack (ToR) and End of Row (EoR). ToR and EoR are terms defining cabling in a rack. ToR means to place the switch for connecting servers to the network on the top of racks. If servers are densely deployed in racks and bear massive traffic, one ToR switch will be placed on the top of each rack. If servers are loosely deployed in racks and bear average traffic, one ToR switch may be shared by multiple racks. When selecting a ToR switch, select a GE or 10 GE one based on your needs. The ToR switch is used to connect network ports of physical servers to the aggregation switch or the core switch, connecting the physical servers to the network. The following figure shows the ToR cabling mode.

ToR cabling mode

EoR differs from ToR in that an independent rack among a row of racks is used to deploy the switch for connecting servers to a network. This switch is called an EoR switch. As its name implies, End of Row (EoR) reveals that the switch is placed at the end. However, in many cases, the switch is placed in the middle of a rack row to reduce the length of cables required between servers and the switch. Network cable distribution frames and optical fiber distribution frames are prepared in advance for other racks. Network ports of servers are directly connected to distribution frames, and are then connected to the EoR switch. The following figure shows the details.

EoR cabling mode

ToR and EoR are suitable for their own scenarios and have limitations.

EoR limitations: You have to lay out cables between the server rack and the network rack in advance. Therefore, servers in the equipment room must be planned and deployed elaborately. For example, you have to determine the number of servers, the number of network ports on each server, and available types of network port. The farther the network rack is apart from the server rack, the longer the required cable in the equipment room is. In addition, to ensure a tidy environment in the equipment room, cables must be bundled in order. When the quantity of required cables is different from the quantity of existing cables, or when a cable fault occurs, the cables must be laid out again. As a result, heavy cable management and maintenance workloads are generated, with low flexibility.

ToR limitations: Each server rack has a fixed output power. Therefore, only a limited number of servers can be deployed, and only 12 to 16 ports are available. Currently, a switch usually has at least 24 ports. As a result, the access port usage of switches in the rack is low. If multiple server racks share one or two access switches, the switch port usage will no longer be low. However, this is like small-scale EoR cabling, which results in more cable management and maintenance.

When drawing an overall network cabling plan, work out an optimal solution by considering the respective advantages and limitations of EoR and ToR.

After servers are connected to the network, they are divided by the type of network traffic, which can be service traffic, management traffic, or storage traffic. Service traffic and storage traffic are vital to users. When users access a target service, service traffic is produced. If service data is stored on a professional storage device instead of a local server, storage traffic is produced when a server accesses the storage device. When users manage servers, virtualization devices, and storage devices, management traffic is produced. Currently, nearly every physical device is configured with an independent management interface. If management traffic and service traffic are carried on different physical lines and interfaces, this is out-of-band management. If management traffic and service traffic are carried on a same physical channel, this is in-band management.

In a cloud computing data center, a high-end layer-3 switch is used as the core of the entire network during network design. The default gateways of all traffic-related network segments are configured on the switch. In this way, all inter-broadcast-domain access will pass through the switch. The reasons for this measure are as follows:

The high-end layer-3 switch has a high forwarding performance and can meet the requirements for forwarding traffic on the entire network.
The high-end layer-3 switch has a modular structure, with excellent fault tolerance and high scalability. The high-end layer-3 switch has such necessary modules as the power supply and the fan, and has also a core component, which is the engine board. They are all deployed in 1+1 hot backup, improving device availability. The switching board is hot swappable, enabling users to scale out network resources at any time.
The high-end layer-3 switch provides boards with various interface densities, such as 10GE, 40GE, and 100GE. The high-end layer-3 switch supports large-capacity, high-density server access and ToR aggregation.
Apart from such basic functions as routing and switching, the high-end layer-3 switch supports other features that suit cloud computing, such as large layer 2, stacking, and virtualization.

All traffic is first transmitted to the layer-2 switch before being transmitted to the core switch. The access modes are EoR and ToR, which we have discussed earlier. Based on the types of incoming traffic, access switches are divided into management, storage, and service switches. For a data center with a huge traffic volume, it is recommended that a specific physical switch be used to carry a specified type of traffic. This means that the switch for each type of traffic is an independent device. For a data center with an average traffic volume, a same physical switch can be used to handle several different types of traffic, and traffic flows are logically isolated from each other using VLANs. The following figure shows the situation where all traffic is physically and logically isolated.

Access layer switch

A physical server uses its own physical NIC to connect to the network. All VM traffic enters the entire network through various types of network ports. The physical NIC involves a key concept, which is port (link) aggregation. Port aggregation indicates that multiple physical Ethernet links are bonded into one logical link to increase the link bandwidth. In addition, the bundled links dynamically back up each other, greatly improving link reliability. As the network scale expands, users have increasingly high requirements on the bandwidth and reliability of the backbone link. Originally, to increase the bandwidth, users used high-speed devices to replace old devices. This solution, however, is costly and inflexible. LACP allows multiple physical ports to be bonded into a logical port to increase the link

bandwidth without upgrading hardware. In addition, the link backup mechanism of LACP provides higher link transmission reliability.

LACP has the following advantages:

Increased bandwidth The maximum bandwidth of a link aggregation port reaches the total bandwidth of all member ports.

Higher reliability If an active link fails, the traffic is switched to another available member link, improving reliability of the link aggregation port.

Load balancing In a link aggregation group, loads can be shared among member active links.

Link aggregation can work in manual load balancing mode and LACP mode.

In manual load balancing mode, users must manually create a link and add member interfaces into the link. In this case, the LACP protocol is not involved. In this mode, all active links work in load-sharing mode to forward data. If an active link is faulty, the remaining active links evenly share the traffic. If a high link bandwidth between two directly connected devices is required but one of the devices does not support the LACP protocol, you can use the manual load balancing mode.

The manual load balancing mode cannot detect other faults, such as link layer faults and incorrect link connections. To improve link fault tolerance and provide the backup function for high reliability of member links, the LACP protocol is introduced. The LACP mode is actually a link aggregation mode using LACP. LACP provides a standard negotiation mode for devices that exchange data, so that devices can automatically form an aggregated link and start this link to send and receive data according to its own configuration. After the aggregation link is set up, LACP maintains the link status and implements dynamic link aggregation and deaggregation.

Virtual Networks in Virtualization

Virtual Network Architecture

As cloud computing and virtualization are becoming increasingly popular, layer-2 access switches no longer function as the network access layer. Instead, layer-2 access switches will be deployed on servers to connect VMs. In this case, virtual switches are required and they function as a real network access layer.

In previous sections we have discussed about the advantages of cloud computing and virtualization. Due to these advantages, they have now become the mainstream IT technologies. However, with every new technology comes new challenges. With the widespread use of virtualization, it is no longer the physical server that runs the workloads. With a physical server, a dedicated network cable is used to connect to a switch, and the few workloads running on this server share this network cable. Currently, multiple VMs run on each physical server and use one network cable to transmit various types of traffic. The new challenge now is how to manage different types of traffic and their statuses.

Traditionally, the IT staff is divided into host engineers, network engineers, and software developers. Each of these roles has clear-cut responsibilities. In a cloud or virtualized environment, however, virtual switches run inside servers, so that when a fault occurs in such an environment, it is not always easy to determine who gets to troubleshoot the virtual switches: the network engineer or the host engineer. Therefore, it is necessary for both the network engineer and the host engineer to acquaint them with the architecture of the virtual network.

Architecture of a virtual network

A common virtualization system uses the architecture shown in the preceding figure. In a personal or small-scale virtualization system, VMs are bound to physical NICs using bridges or NAT. In a large-scale corporate virtualization system, VMs are connected to physical networks using virtual switches.

Network bridge is not a new technology. You can use network bridges to enable network ports to communicate with each other. Specifically, you can use network bridges to interconnect multiple network ports so that packets received on one network port will be replicated to the others.

In a virtualization system, the OS is responsible for interconnecting all network ports. The following figure shows bridge-based interconnection in a Linux system.

Bridge-based interconnection in a Linux system

Bridge-based interconnection in a Linux system

Bridge 0 (network bridge device) is bound to eth0 and eth1. The upper-layer network protocol stack knows only Bridge 0 and does not need to know bridging details because the bridge is implemented at the data link layer. When the upper-layer network protocol stack needs to send packets, it sends the packets to Bridge 0, and the processing code of Bridge 0 determines whether the packets will be forwarded to eth0, eth1, or both. Similarly, packets received on eth0 or eth1 are sent to Bridge 0, and the processing code of Bridge 0 determines whether to forward the packets, discard the packets, or send the packets to the upper-layer network protocol stack.

The network bridge provides the following key functions:

MAC learning

The network bridge learns MAC addresses. Initially, the network bridge has no mapping between MAC addresses and ports and sends data like a HUB. However, when sending data, the network bridge learns the MAC addresses of data packets and finds the corresponding ports, setting up a mapping between MAC addresses and ports (CAM table).

Packet forwarding

When sending each data packet, the network bridge obtains its destination MAC address and searches the MAC address-port table (CAM table) to determine the port through which the data packet will be sent.

With virtualization, each VM has a virtual NIC. A Linux OS will generate a TAP device in user mode and a tun/tap device driver and NIC driver in kernel mode. Data sent from a VM passes through the tun/tap character device file in user mode, passes through the tun/tap device driver and virtual NIC driver in kernel mode, is sent to the TCP/IP protocol stack, and is then forwarded to the virtual NIC tap in user mode. The tap is now directly connected to the network bridge, and eth0 and eth1 in the preceding figure will act as the NIC tap of the VM. The following figure describes the details.

Forwarding traffic over a network bridge

displayed, as shown in the following figure.

The information reveals that the MAC addresses of the two tap devices are fe:6e:d4:88:b5:06 and fe:6e:d4:88:c6:29. Log in to the VRM page and query the MAC addresses of the existing VMs. You will find that the MAC addresses of the two tap devices match the MAC addresses of the existing VMs.

MAC addresses of the existing VMs

If only the network bridge is used, VMs can communicate with external networks using bridges or NAT. When a bridge is used, the network bridge functions as a switch, and the virtual NIC is connected to a port of the switch. If NAT is used, the network bridge functions as a router, and the virtual NIC is connected to a port of the router.

When the virtual NIC is connected to a port of the switch, the virtual NIC and the network bridge, with the same IP address configuration, communicate with each other in broadcast mode. When the virtual NIC is connected to a port of the router, the virtual NIC and the network bridge are configured with IP addresses that belong to different IP address ranges. In this case, the system automatically generates an IP address range, the virtual NIC communicates with other networks including the network bridge in layer-3 routing forwarding mode, and address translation is conducted on the network bridge. In the Linux system, the IP address range generated by default is 192.168.122.0/24, as shown in the following figure.

IP address of the network bridge using NAT in the Linux system

NAT is used for address translation. With NAT used for address translation, when a VM communicates with an external network through the NAT gateway, which is virbr0 in the preceding figure, the source IP address of the IP packet is translated into the IP address of the physical network bridge and a record is produced accordingly. When the external network accesses the VM, the NAT gateway forwards data packets to the VM based on the record.

NAT is widely used and has the following advantages:

When the IP address range for the physical network bridge has insufficient available IP addresses, a new IP address range can be added.
The source IP address can be concealed. When a VM accesses an external network, the IP address of the VM is translated on the NAT gateway. Therefore, the external network will not know the IP address, protecting VM security.
Load balancing NAT provides the redirecting function. When multiple VMs with the same applications are deployed in active/standby mode, NAT can translate their IP addresses into one IP address used for communication with external networks. In addition, load balancing software is used for evenly distributing service access.

The bridge and NAT are suitable for personal or small-scale systems. When the bridge is used, the statuses of virtual NICs cannot be viewed and the traffic on virtual NICs cannot be monitored. The bridge supports only the GRE tunnel, with limited functions. The network bridge does not support software-defined networking (SDN), which is widely used currently. Therefore, in a large-scale system, the virtual switch is used for VMs to communicate with external networks. The virtual switch acts as an upgraded network bridge, removing the defects of the network bridge.

Currently, each virtualization vendor has its own virtual switching product, such as VMware vSwitch, Cisco Nexus 1000V, and Huawei DVS. The following describes open-source Open vSwitch.

Open vSwitch (OVS) is an open-source, high-quality, and multi-protocol virtual switch. It is developed by Nicira Networks using the open-source Apache2.0 license protocol. Its main codes are portable C codes. Open vSwitch is designed to enable massive network automation through programmatic extension and also support the standard management interfaces and protocols, such as NetFlow, sFlow, SPAN, RSPAN, CLI, LACP, and 802.1ag. Open vSwitch supports multiple Linux virtualization technologies, such as Xen and KVM.

The OVS official website describes Why-OVS as follows:

The mobility of state

All network state associated with a network entity (say a virtual machine) should be easily identifiable and migratable between different hosts. This may include traditional "soft state" (such as an entry in an L2 learning table), L3 forwarding state, policy routing state, ACLs, QoS policy, monitoring configuration (e.g. NetFlow, IPFIX, sFlow), etc. Open vSwitch has support for both configuring and migrating both slow (configuration) and fast network state between instances. For example, if a VM migrates between end-hosts, it is possible to not only migrate associated configuration (SPAN rules, ACLs, QoS) but any live network state (including, for example, existing state which may be difficult to reconstruct). Further, Open vSwitch state is typed and backed by a real data-model allowing for the development of structured automation systems.

Responding to network dynamics

Virtual environments are often characterized by high-rates of change. VMs coming and going, VMs moving backwards and forwards in time, changes to the logical network environments, and so forth. Open vSwitch supports a number of features that allow a network control system to respond and adapt as the environment changes. This includes simple accounting and visibility support such as NetFlow, IPFIX, and sFlow. But perhaps more useful, Open vSwitch supports a network state database (OVSDB) that supports remote triggers. Therefore, a piece of orchestration software can "watch" various aspects of the network and respond if/when they change. This is used heavily today, for example, to respond to and track VM migrations. Open vSwitch also supports OpenFlow as a method of exporting remote access to control traffic. There are a number of uses for this including global network discovery through inspection of discovery or link-state traffic (e.g. LLDP, CDP, OSPF, etc.).

Maintenance of logical tags

Distributed virtual switches often maintain logical context within the network through appending or manipulating tags in network packets. This can be used to uniquely identify a VM (in a manner resistant to hardware spoofing), or to hold some other context that is only relevant in the logical domain. Much of the problem of building a distributed virtual switch is to efficiently and correctly manage these tags. Open vSwitch includes multiple methods for specifying and maintaining tagging rules, all of which are accessible to a remote process for orchestration. Further, in many cases these tagging rules are stored in an optimized form so they don't have to be coupled with a heavyweight network device. This allows, for example, thousands of tagging or address remapping rules to be configured, changed, and migrated. In a similar vein, Open vSwitch supports a GRE implementation that can handle thousands of simultaneous GRE tunnels and supports remote configuration for tunnel creation, configuration, and tear-down. This, for example, can be used to connect private VM networks in different data centers.

GRE provides a mechanism for encapsulating packets of one protocol into packets of another protocol. For example, OSPF is used on the LAN, EGP is used on the WAN, and GRE is used to encapsulate OSPF packets into EGP packets before they are transmitted. When you are mailing an international letter, the address on the envelope may be written in English while the letter itself may be written in your mother tongue. In this case, you mother tongue is like OSPF, the English language is like EGP, and the envelope wrapping the letter is like the GRE technology.

Hardware integration

Open vSwitch's forwarding path (the in-kernel datapath) is designed to be amenable to "offloading" packet processing to hardware chipsets, whether housed in a classic hardware switch chassis or in an end-host NIC. This allows for the Open vSwitch control path to be able to both control a pure software implementation or a hardware switch. There are many ongoing efforts to port Open vSwitch to hardware chipsets. These include multiple merchant silicon chipsets (Broadcom and Marvell), as well as a number of vendor-specific platforms. The advantage of hardware integration is not only performance within virtualized environments. If physical switches also expose the Open vSwitch control abstractions, both bare-metal and virtualized hosting environments can be managed using the same mechanism for automated network control.

In summary, in many ways, Open vSwitch targets a different point in the design space than previous hypervisor networking stacks, focusing on the need for automated and dynamic network control in large-scale Linux-based virtualization environments. The goal with Open vSwitch is to keep the in-kernel code as small as possible (as is necessary for performance) and to re-use existing subsystems when applicable (for example Open vSwitch uses the existing QoS stack). As of Linux 3.3, Open vSwitch is included as a part of the kernel and packaging for the userspace utilities are available on most popular distributions.

OVS has the following key components:

ovs-vswitchd: This is the main module of OVS, which is used to implement the daemon of the switch and which includes a Linux kernel module that supports stream switching.

ovsdb-server: This is a lightweight database server that enables ovs-vswitchd to obtain configuration information.

ovs-dpctl: This is used to configure the kernel module of the switch.

Some auxiliary OVSs with scripts and specifications are installed on Citrix XenServer as

default switches.

ovs-vsctl: This is used to query and update the ovs-vswitchd configuration.

ovs-appctl: This is used to send command messages and run related daemons.

In addition, OVS provides the following tools:

ovs-ofctl: This is used to query and control the OpenFlow switch and controller.

ovs-pki: This is the public key framework used to create and manage OpenFlow switches.

The following figure shows the process for forwarding data packets on OVS.

Forwarding data packets on OVS

1. The data packets generated on the VM are sent to the virtual NIC eth0.

2. The data packets are first sent to the tun/tap device, which is vnet in the preceding figure.

3. They are then transmitted by vnet to the network bridge.

4. Finally, the data packets are forwarded by the network bridge to the physical machine NIC eth1 that is connected to the network bridge, and the data packets are forwarded by eth1 to the physical layer-2 switch.

The virtual switch can be a common virtual switch or a distributed virtual switch (DVS). A common virtual switch runs only on a single physical host. All network configurations apply only to VMs on the physical host. Distributed virtual switches are deployed on different physical hosts. You can use the virtualization management tool to configure distributed virtual switches in a unified manner. The distributed virtual switch is required for VM live migration. Huawei virtualization products use distributed virtual switches. This section describes Huawei distributed virtual switches.

Network Features of Huawei Virtualization Products

Network Solutions in Huawei Virtualization Products

Huawei distributed virtual switches can be centrally managed. The centralized management modules provide a unified portal for simplified configuration management and user management.

DVS

The virtual switches distributed on physical servers provide VMs with a range of capabilities, including layer-2 connectivity, isolation, and QoS.

The DVS model has the following characteristics:

Multiple DVSs can be configured, and each DVS can serve multiple CNA nodes in a cluster.
A DVS provides several virtual switch ports (VSP) with configurable attributes, such as the rate and statistics. Ports with the same attributes are assigned to the same port group for easy management. Ports that belong to the same port group are assigned the same VLAN.
Different physical ports can be configured for the management plane, storage plane, and service plane. An uplink port or an uplink port aggregation group can be configured for each DVS to enable external communication of VMs served by the DVS. An uplink aggregation group comprises multiple physical NICs working based on preconfigured load-balancing policies.
Each VM provides multiple virtual NIC (vNIC) ports, which connect to VSPs of the switch in one-to-one mapping.
A server allowing layer-2 migration in a cluster can be specified to create a virtual layer-2 network based on service requirements and configure the VLAN used by this network.

Internal structure of a virtual switch

The configuration of VM port attributes can be simplified by configuring port group attributes instead, including security and QoS. Modifying port group attributes has no impact on the proper running of VMs.

A port group consists of multiple ports with the same attributes. By configuring the port group attributes, including bandwidth QoS, layer-2 security attributes, and VLAN, the administrator configures the attributes of all VM ports in the group, thus saving a lot time. Port group attribute changes do not affect VM running.

An uplink connects the host and the DVS. Administrators can query information about an uplink, including its name, rate, mode, and status.

Uplink aggregation allows multiple physical ports on a server to be bound as one port to connect to VMs. Administrators can set the bound port to load balancing mode or active/standby mode.

Huawei DVS supports the virtual switching function of open-source Open vSwitch. The

following figure shows the structure of Huawei DVS.

Internal structure of Huawei DVS

Huawei DVS has the following characteristics:

Unified portal and centralized management simplify user management and configuration.
The open-source Open vSwitch is integrated to make full use of and inherit virtual switching capabilities of open source communities.
Rich virtual switching layer-2 features, including switching, QoS, and security isolation, are provided.

DVS Traffic Flow

VMs run on the same host but belong to different port groups.

Traffic flow when VMs run on the same host but belong to different port groups

A virtual switch is essentially a layer-2 switch. The port group has a vital parameter, which is VLAN ID. If two VMs belong to different port groups, they are associated with different VLANs, and cannot detect each other using broadcast. Generally, when two VMs are associated with different VLANs, they are configured with IP addresses that belong to different IP address ranges. Enabling communication between the two VMs requires a layer-3 device, which can be a layer-3 switch or router. In Huawei FusionCompute, layer-3 switching can only be performed by physical layer-3 devices. Therefore, when these two VMs intend to communicate with each other, the traffic needs to come from one host and arrive at the physical access switch, and then be forwarded to the peer layer-3 device and routed to the other host.

VMs run on the same host and belong to the same port group.

Traffic flow when VMs run on the same host and belong to the same port group

When two VMs run on the same host and belong to the same port group, they belong to the same broadcast domain, in which case, they can communicate with each other through a virtual switch and the traffic will not enter the physical network.

VMs run on different hosts but belong to the same port group.

Traffic flow when VMs run on different hosts but belong to the same port group

When two VMs belong to the same port group, they may detect each other by sending out broadcast packets. However, because the two VMs run on different hosts, they need a physical switch to connect to the network. An exception is that the two physical servers are directly connected. Therefore, when two VMs that run on different hosts but belong to the same port group intend to communicate with each other, the traffic needs to pass through the physical switch. The two VMs can communicate with each other without using the layer-3 device, which is different from the situation when the two VMs run on

different physical servers and belong to different port groups.

Multiple DVSs run on a physical host.

In real-world scenarios, multiple DVSs usually run on the same physical host. Generally, when two VMs are connected to different DVSs, the port groups associated with the two DVSs have different VLAN IDs, which means that the two VMs use different IP addresses. In this case, the traffic between the two VMs will need to be routed through a layer-3 device.

Security Group

Users can create security groups based on VM security requirements. Each security group provides a set of access rules. VMs that are added to a security group are protected by the access rules of the security group. Users can add VMs to security groups for security isolation and access control when creating VMs. A security group is a logical group that consists of instances that have the same security protection requirements and trust each other in the same region. All VM NICs in a security group communicate with each other by complying with the security group rules. A VM NIC can be added to only one security group.

Security group

The security group provides a similar function as the firewall does. They both use iptables to filter packets for access control.

Netfilter/iptables (iptables for short) functions as the firewall filtering packets on the Linux platform. iptables is a Linux userspace module located in /sbin/iptables. Users can use iptables to manage firewall rules. netfilter is a Linux kernel module that implements the firewall for packet filtering.

Working principles of the security group

In the preceding figure, Netfilter is a data packet filtering framework. When processing IP data packets, it hooks five key points, where services can mount their own processing functions to implement various features.

Filtering packets using rules

The iptables configuration is processed based on chains and rules in the table. Network packets that enter the chain match the rules in the chain in sequence. When a packet matches a rule, the packet is processed based on the action specified in the rule. iptables contains four tables, which are RAW, Filter (filtering packets), NAT (translating network addresses), and Mangle (changing TCP headers). These tables generate their own processing chains at the five key points as required and mount the entry functions for chain processing to the corresponding key point.

Virtual Switching Mode

The Huawei virtual switch provides the following virtual switching modes: common mode, SR-IOV mode, and user mode.

Common mode

In this mode, a VM has two vNICs, front-end vNIC and back-end vNIC. The front-end vNIC connects to a virtual port of the virtual switch. VM network packets are transmitted between the front- and back-end vNICs through an annular buffer and event channel, and forwarded by the virtual switch connected to the back-end vNIC. The following figure shows the details.

Common mode

SR-IOV mode

Single Root I/O Virtualization (SR-IOV) is a network I/O virtualization technology proposed by Intel in 2007 and is now a PCI-SIG standard. A physical NIC that supports SR-IOV can be virtualized into multiple NICs for VMs and it seems that the VMs enjoy an independent physical NIC. This improves network I/O performance compared with software virtualization and this requires fewer hardware NICs compared with PCI Passthrough. The following figure shows the details.

SR-IOV mode

User mode

The user-mode driver is loaded to the virtual port, and a thread is started in vswitchd to replace the function of receiving and sending packets in a kernel state. A data packet received from a NIC is directly received in the thread of vswitchd. A received data packet is directed to match rules in the accurate flow table of vswitchd and is then sent from the specified port by executing the OpenFlow action and instruction. The advantage of DPDK is that it improves the port I/O performance. In addition, packet sending and receiving and openflow-based data forwarding are implemented in user mode, reducing the overhead caused by switching between kernel mode and user mode and improving the network I/O performance. Compared with SR-IOV, advanced features, such as live migration and NIC hot-add, are supported. The following figure shows the details.