You are here
Control and Management Planes – Part 3
In the previous installment we discussed how SDN replaced distributed routing by centralized path computation. But SDN introduced an additional novel abstraction – the whitebox switch.
An SDN whitebox switch is an abstraction of the forwarding behavior of network elements in packet networks. Today’s packet networks are composed of many different network elements, for example routers, switches, NATs and firewalls. Yet, at a high enough level of abstraction all these network elements perform exactly the same operations. They all receive a packet at some input port, observe some fields of the packet, perform some computations (e.g., table look-up or longest prefix match), take some decisions (whether and how to forward the packet), optionally modify the packet, and finally forward the packet through some output port. The difference between different network elements is in the details – which fields they observe, what decision algorithm they perform, etc.
To sum up, SDN emphasizes two different abstractions: the first is that path computation (figuring out how to route a packet from source to destination) is a computational problem; and the second is that packet forwarding (routing a given packet in real-time form source to destination) is also a computational problem.
Path computation can be solved in either distributed or centralized manner. Distributed routing protocols are excellent at discovering basic connectivity, and very good at optimizing a single monotonic cost function (such as hop count); but there are many tasks that are beyond their capabilities. SDN’s omniscient controller has access to the entire network topology and state, and arbitrary path computation problems can be reduced to running graph optimization algorithms.
On the other hand, packet forwarding is by its very definition always a distributed computational problem. And there is a theorem that holds for distributed computational systems that is relevant here.
The CAP theorem (also known as Brewer’s theorem) states that the three desirable characteristics of distributed computational systems, namely
- Consistency (get the same answer regardless of which elements are involved)
- Availability (get an answer without unnecessary delay)
- Partition Tolerance (get an answer even if there a malfunctions in the system)
cannot all be simultaneously satisfied. You can get any two, but not all three.
The reasoning behind the CAP theorem is not mysterious. If we centralize all the intelligence to obtain Consistency, then the system is only Available if there is no loss of communications with the centralized intelligence (a partition). If we distribute all the intelligence to obtain Availability, the system can only be Consistent if there is no loss of communications between the intelligent agents.
Which CAP characteristics must be satisfied in conventional three-plane networks and which can we forgo? The answer depends on the stage of the service involved.
Service providers pay dearly for service failures, not only in lost revenues, but in SLA violation penalties. So, once a service has been commissioned and delivered to the customer, characteristics guaranteed in the SLA, notably:
- high Availability (e.g., five nines) and
- high Partition tolerance (e.g., 50 millisecond restoration times)
must be maintained. According to the CAP theorem, this means that Consistency must suffer, and indeed distributed routing has only eventual consistency, and packets are sometimes black-holed (compensated by TTL fields). How are Availability and Partition Tolerance guaranteed? Via the distributed control plane (for example, Automatic Protection Switching).
However, when a new service is first being set up, Consistency is emphasized (that’s what commissioning testing is all about). This is accomplished by using the management plane for set-up operations. Of course Availability suffers (thus, set-up is often a lengthy process!), and Partitions are not allowed (faults during commissioning trigger manual operations).
The use of management plane for commissioning and the control plane for deployed services is a conscious decision on the part of the service provider. Indeed, a precise trade-off is maintained by a judicious combination of centralized management and distributed control planes. Without both planes, current SLAs cannot be guaranteed.
Due to its software roots, SDN has emphasized Consistency, so SDN-based networks must forgo either Availability or Partition tolerance (or both). Relying solely on a single centralized controller (i.e., a pure management system) may lead to more efficient bandwidth utilization but means giving up Partition tolerance. If some mechanism is put in place to protect against controller and southbound API failure, then the CAP theorem mandates that Availability must suffer.
So, two-plane SDN cannot be used to build services defined by SLAs as currently defined. Of course, we could rewrite SLAs to emphasize Consistency at the expense of Availability, but it is not clear why customers would prefer such an SLA over Best Effort delivery.
To sum up what we have discussed in these three blog posts:
- Networking theory defines three planes – forwarding, control, and management.
- Although the original differentiation between control and management was based on whether a human was in the loop, over time human functions were replaced by automation, and the difference that remains is distributed vs. centralized functionality.
- Providing both control and management planes enables service providers to consciously trade-off consistency vs. availability vs partition tolerance, in order to be able to commission and maintain SLA-based services.
- SDN defines only two planes, and the non-forwarding one can be better approximated as a management plane (although the SDN literature calls it the control plane).
- SDN’s success is based on its facilitation of network automation and consequent reductions in operations costs, but its centralized, consistency-centric approach makes it ill-suited for warranting the present generation of SLAs (which require at least some intelligence in the network).
About RAD's Blog
We’ll be blogging on a wide range of hot topics affecting service providers and critical infrastructure network operators. Our resident experts will be discussing vCPE, Cyber Security, 5G, Industrial IoT and much, much more.