Resilient Routing

SDN Control Plane Setup
Figure 1: SDN Control Plane Setup

Software-defined networking (SDN) simplifies network devices by moving control plane functions to a logically centralized control plane; therefore data plane devices become simple programmable forwarding elements. SDN controllers use OpenFlow APIs to set up forwarding rules and collect statistics at the data plane, which enables controller software and data plane hardware to evolve independently. For scalability and reliability, the logically centralized control plane (“network OS”) is often realized via multiple SDN controllers (see Figure 1), forming a distributed system. SDN controllers use distributed consensus protocols, like Raft, to manage the network state and provide a highly available cluster to the underlying networking elements. Therefore, SDN controller liveness depends on all-to-all message delivery between cluster servers.  

The design of fault-tolerant distributed consensus algorithms typically focuses on server failures alone, while assuming the underlying network will handle connectivity issues on its own. Such assumptions about the network hold true in classical IP networks, where distributed routing algorithms running on routers cooperate with each other to establish new paths after failures. However, SDN creates cyclic dependencies among control network connectivity, consensus protocols, and control logic managing the network. The control logic is built on top of a distributed system (e.g., ONOS) which relies on consensus protocols for consistency and control network connectivity for communication, whereas the network data plane (and control network) hinges on this distributed system to set up rules to control and enforce “who can talk to whom” among networking elements. Consequently, SDN introduces new network failure scenarios that are not explicitly handled by existing consensus algorithms. We demonstrate how these failure scenarios can severely affect the correctness and operations' efficiency of consensus protocols.

We propose that in order to fundamentally break this inter-dependency, it is crucial to equip the SDN control network with a resilient routing mechanism which uses only local data plane operations to achieve resiliency under arbitrary link/node failures, without any involvement from the control plane to recompute routes. Thus, avoiding the cyclic dependency between control network connectivity and management, where controllers need to setup rules to recover from failures, but cannot reach switches because of failures. More details about this work can be found in the publications listed below.


Publications

  1. When Raft Meets SDN: How to Elect a Leader and Reach Consensus in an Unruly Network.
    Yang Zhang, Eman Ramadan, Hesham Mekky, and Zhi-Li Zhang. Asia-Pacific Workshop on Networking (APNet'17), 2017. (Best Paper Award)
     

  2. Adaptive Resilient Routing via Preorders in SDN.
    Eman Ramadan, Hesham Mekky, Braulio Dumba, and Zhi-Li Zhang. Distributed Cloud Computing Workshop (DCC'16), 2016.


Supplementary Materials

  1. Presentation Slides for APNet'17 Paper
  2. Presentation Slides for DCC'16 Paper