Course:CPSC522/Network Agent

From UBC Wiki
Jump to navigation Jump to search

Modelling Datacenter Traffic as Reinforcement Learning Problem

An investigation into applying artificial intelligence techniques to datacenter congestion control.

Principal Author: Fabian Ruffy
Collaborators: Michael Przystupa, Anand Jayarajan (not participants in this course but have been essential in getting this project to work, thanks a lot!)

Note: This is an ongoing project and we appreciate any help or feedback on the choices we made. We use deep reinforcement learning in this project as it is the most viable AI approach that is well documented as well as readily available to us. If there is a better option which may produce better results we are happy to try it out. The repository we are working on can be found here.


This work is focussed on the introduction of artificial intelligence in the space of datacenter networking. In particular, it centers around the design of an environment which can be managed by an autonomous network arbiter. This arbiter decides traffic sending rates for all elements in the network and continously re-evaluates its policy and decision making. In this article, we present a basic prototype and evaluate its effectiveness on a simple dumbbell topology.

Builds on

This project builds on reinforcement learning, in particular deep reinforcement learning. It uses convolutional and standard neural networks.

Related Pages

AntNet is one form of network optimization using reinforcement learning methods. This work is also closely related to Artificial Intelligence for Locomotion Control, which uses Deep Learning to learn robotic movement.


Motivation and Background

This section covers the motivation to introduce centralized artificial intelligence in networks. It outlines the limitations of current network algorithms, which are largely designed by humans, and explains why it is now possible to manage a network centrally and automatically.

Congestion Control in Networking

Initially, research in network congestion was dominated by the assumption of a decentralized, autonomous network. End-hosts only have control over the amount of traffic they send, and are unaware of the intentions or traffic rates of their peers. Similarly, switches are unaware of global traffic patterns and only forward based on their local notion of optimality.

In line with these assumptions, the transmission control protocol (TCP) was designed to optimize traffic globally on a simple fairness principle: nodes reduce sending rates in response to packet being lost and assume that others will behave accordingly. An equilibrium is reached when all end-hosts achieve a traffic rate that, in sum, conforms to the maximum available bandwidth of the most congested link.

The Limitations of TCP

Developments in the past decade have changed the general networking environment. Datacenters have emerged as a new research domain, which poses novel design challenges and opportunities. Driven by minimization of costs and maximization of compute power, they must run at the highest possible utilization to achieve the ideal compute/cost ratio. Inefficient routing and congestion control can quickly lead to bufferbloat and the eventual collapse of a high-load network, requiring more sophisticated solutions. In fact, recent insights into datacenter network traffic have shown that buffer overruns and small short-lived bursts are the primary factor of loss.[1][2]

Despite the innovation potential of datacenters, TCP has dominated as the congestion control algorithm of choice. TCP's algorithm is a best-effort solution to globally maximize utilization under the fairness principle. This design has not fundamentally changed over the last thirty years, and new research advancements raise the question whether TCP can still be considered the most sensible option. A substantial and noticeable increase of loss and latency in the network already portends a problem. Overflowing queues in forwarding elements or mismatched hardware capabilities imply that traffic has not been optimally managed. An ideal network should always be zero-queue. Latency will merely be induced by propagation, and not queuing delay.

Centralized Network Managers

Software-Defined Networking (SDN) provides operators with the ability to freely control and adapt their network architecture, leading to highly customized systems and fine-grained network optimization.

SDN introduces the notion of "centralized management". A single controller with global knowledge is able to automatically modify and adapt the forwarding tables of all forwarding elements, while notifying end hosts of changes in the network. This full architectural control facilitated new opportunities in the space of TCP congestion research. Traffic can now be managed in a centralized fashion based on global knowledge of the entire topology and traffic patterns.

Following SDN's rise in popularity, a new line of centralized schedulers has emerged. One limitation of these schedulers is that they are still reactive in nature. The central controller responds to changes in the network or requests by applications, which may cost valuable round-trip latency. Often, short-term flows or bursts are unaccounted for, which causes undesirable packet loss and back-propagating congestion.

Admission Control

In parallel, admission-control-based approaches have gained traction as a viable alternative to the conventional burst-and-backoff mechanism of TCP[3] [4] . While the idea of admission control and service guarantees in networks is not new (common examples are techniques such as token or leaky bucket), these designs traditionally aim to assure quality and bandwidth guarantees in contentious, decentralized, and untrusted environments such as wide-area networks. In a datacenter, these conditions do not apply. End-hosts are generally considered reliable and restricted in behaviour, which allows for great simplification of enforcement and prioritization policies. Admission-based systems can give administrators the advantage of tight control over the datacenter, mitigating the effects of queue-buildup and bursts.

Learning Network Traffic Patterns

Although it is unclear how predictable datacenter traffic truly is, various works have shown that repeated patterns may exist[5][6][7][8]. If datacenter traffic exhibits predictable patterns, it may be possible to devise a proactive mechanism which forecasts the traffic matrix of next iterations, and adjusts traffic accordingly.

An ideal culmination of these insights is a global, centralized arbiter, able to predict and fairly distribute flows in the network. By only allowing a safe sending rate for each participant it minimizes the amount of bursts and loss possible. Treating the network's compute and forwarding power as a single finite resource, the controller acts akin to a OS scheduler distributing CPU time slices to processes.

In recent years, machine learning (ML) has emerged as a promising strategy for automated management. ML has shown promising results in many networking domains, including, but not limited to, datacenter power management[9], job scheduling[10], and even cellular congestion control[11]. In the domain of datacenter networking, ML systems have been proposed in the form of routing[12] and host-local congestion control optimization[13].

Our Hypothesis

This project is an investigation of reinforcement learning in the datacenter. We explore how an agent can act and observe the environment effectively, and what is required for it to function reliably. We transpose problems and solutions from the space of reinforcement learning into the DC context. We dissect the typical architecture of RL and investigate how each component can be mapped to attributes of congestion control and DC management. Our hypothesis is that such an agent may learn to observe traffic and adapt to it. The ultimate goal is to develop an agent which is able to predict network surges and traffic patterns based on the current input of the traffic matrix. Ideally, it will be able to manage hosts predictively and adjust bandwidth of participating hosts pro-actively.

Figure 1: Scenario of a parking lot topology which experiences congestion.

An example use case can be seen in Figure 1. In this scenario, each switch link has a capacity of 10 mbit per second. H6 sends data to H1, H4, and H5 in Period 1, which will cause these hosts to reply with a large data stream in Period 3. In the normal case, every host is unrestricted in its sending rate, causing congestion at the bottleneck link of the switches. Packets are lost and latency builds up. In the second scenario, an agent, which is aware of this pattern, observes the flows from H6 and decides to act. In Period 2, it restricts the sending rate of the responsible hosts, and optimally distributes bandwidth to a fair rate. Hosts are now able to send at this bandwidth at a stable and reliable rate without experiencing packet loss and retransmit. This increases the overall goodput of the network.

Reinforcement Learning in the Datacenter


Figure 2: A sample fat tree network architecture with a central network arbiter.

Our agent is intended to act in a datacenter environment which is comprised of many switches and hosts. A typical topology of a datacenter is a fat tree (see Figure 2), in which hosts are contained in sets of racks. Each rack is connected to multiple edge switches, which in turn connect to spine switches with higher bandwidth. Hosts send traffic across the network and try to maximize their sending rate. Traffic, or rather "flows", can collide at switch interfaces, which results in loss and congestion. Typically, the TCP protocol is used to adjust sending rate dynamically.

In our scenario, a global controller manages the sending rate of all hosts. It is able to adjust the interface bandwidth by sending a control packet specifying the current maximum allowed sending rate. The goal is to maximize the utilization of the network while still preventing congestion at bottleneck.

One potential method of maximizing this goal is to reroute traffic to underutilized links. In this project, we focus on adjusting sending rate only. We consider routing to be future work. An visualization of our assumed scenario can be seen in Figure 1 on the right. The agent is able to observe system state of all switches in the network, which represents its environment. It receives reward for an increase in utilization and is penalized for queues at switches (which indicates congestion).

Why Reinforcement Learning?

We have decided to go with a reinforcement learning approach as the traffic patterns of the network are initially unknown to us. Similarly, the artifical intelligence has to gradually experience network traffic behaviour and effects. By learning to interact with this environment we hope to develop a model which eventually adapts to any traffic condition and matrix. While it is possible to define a static model (TCP algorithms are largely static) we believe that a dynamic, learned approach with central knowledge has more capabilities than human-defined formulas. For further interest, the topic of machine-learned against human algorithms is also discussed by Schapira and Winstein[14].

Translating Network Properties for Reinforcement Learning

A fundamental challenge is to formalize what the agent observes, i.e., how the environment of the data center is abstracted. The available options of data acquisition in a DC are manifold, ranging from simple switch statistics, to application flows, job deployment monitoring, or even explicit application notifications. For simplicity and since we address congestion in the network we focus on the bandwidth flowing through the interfaces and the queues building up at devices.

The agent collects receiving and sending bandwidth per port, plus the current queue length. This is a very minimal feature set, and does not include advanced collection such as link latency or flows travelling through an interface. In a very small fat-tree network with 20 switches and four ports per switch, this would already result in a 80x3 matrix, which has to be processed in milliseconds. In addition, the measure of bandwidth per second is continuous and requires discretization.

Regarding actions, we face the challenge of continuous values which are expressed in the ability to adjust the sending rate of an host's interface. Values here can be arbitrarily fine-grained and the amount of simultaneous possible actions grows linearly with the number of hosts in the network. In the domain of reinforcement learning, a similar problem is the space of robotics and motion[15], which observes a complex, continuously changing environment and has to enact complex actions quickly. The space of real-time strategy [16] and first-person shooters [17] also poses a comparable challenge.

We draw from recent research on these topics to model our agent. As our environment is highly complex and volatile, we believe that an actor-critic model is the right choice. It decouples the policy evaluation from the action selection and aids in achieving proper convergence and stability. In particular, we look at work such as the Deep Deterministic Policy Gradient (DDPG) [18] and Asynchronous-Advantage-Actor-Critic (A3C) [19] to implement our agent framework. These models use neural nets as approximators, which are apt to handle large input spaces and train for specific patterns, but suffer from instability and divergence. In theory, actor-critic systems can alleviate these drawbacks making them an interesting choice for our system.


Our network controller consists of two modules: a reinforcement learning agent, which computes the next actions, and a monitor, which collects network statistics and feeds it to the agent. As the agent model we have decided to use the DDPG architecture described by Lillicrap et al[18]. Our implementation largely follows the model described by the researchers. The agent is divided into an actor and a critic plane, each of which use an independent neural net architecture. The replay buffer is fixed at a size of one million different actions. Per iteration we save the model in order to be able to reuse it later. We currently experiment on four different variations of this model:

Agent A

Agent A is our main model. Actor and critic are represented as fully connected neural nets with two hidden layers. We use ReLU as the activation function for each of the hidden layers and Sigmoid as the final output of the network. All actions are chosen at once on the basis of the neural net's output. This algorithm is the basic reimplementation of the DDPG algorithm.

Agent B

Agent B has a similar architecture as Agent A, but uses a one-hot encoding to decide actions. Instead of predicting values and computing actions for all interfaces simultaneously, we predict on one particular interface each iteration. We assumed that, while slower, predicting on a per-interface basis can provide more accurate and reliable results as each interface is trained individually on global information.

Agent C

Agent C is a convolutional neural net with a stride of size two for the input matrix. The kernel size is currently 3, and does not use any pooling. Convolutional neural nets are said to overfit less and generalize better to many domains. Our experiment does not aim for generalization, nonetheless we use this architecture to test the capabilities of convolutional networks.

Agent D

Agent D is a simple online-learning agent without actor-critic infrastructure. It uses the same neural net architecture as Agent A. Agent D serves as a baseline to compare the basic algorithm to the actor-critic deep reinforcement learning model.

As state input we collect a traffic matrix which consists of the statistics of interfaces times a vector of queue size, receiving bandwidth, sending bandwidth, and delta values of the relative queue and bandwidth improvement. We model reward as a trade-off between bandwidth and queue length. Both metrics are normalized to the maximum possible value. We compute reward per interface and sum up the final result. Queue acts as negative penalty whereas bandwidth provides positive reward. To guarantee that reducing queue length is prioritized, we weigh the penalty proportionally to the amount of interfaces in the network. The queue length penalty is also squared, to lessen the comparative impact of smaller queues, which may occur due to measurement error:

As of now, our agent chooses an action and observes traffic behavior for three seconds to guarantee stabilization. While this time is impractical in actual real world networks, it allows us to prototype techniques faster and observe the behavior of the agent.

Our current control platform is written in ~4000 lines of Python. The traffic generators of the clients are written in C++.


Figure 3: The dumbbell scenario the agent is training on.


To test our idea of a predictive, self-learning network agent, we developed a test suite on top of Mininet[20], a networking emulator. We designed several topologies, including a dumbbell (see Figure 3), a parking lot (see Figure 1), and a fat tree topology (see Figure 2). For simplicity, we only test on the dumbbell topology. Our goal is to achieve an optimal allocation of bandwidth without incurring queueing in the network. In our experiment we continuously send UDP traffic from H1 to H3 and from H2 to H4. Initially, the hosts send at max rate, causing a collision at interface 1001-3. The controller is intended to observe this behaviour and learn to adjust the hosts until a network-optimal distribution is reached. We train our agent for ten hours, then switch from an exploration policy to an exploit policy for 5 hours. An agent is successful, if it achieves a pareto-efficient distribution of traffic.


We tested all four agents. The results can be seen in Figure 4 to 7 on the right. The first plot describes the evolving average of the reward per iteration (lower) and overall batched reward (upper). The middle plot describes the average queue size of interface per training period (lower is better and gives more reward), the last plot describes bandwidth on a per-interface basis (higher is better, the maximum is 10 mbps per interface). The bandwidth of interface 1001-eth1 and 1001-eth2 should ideally result in a maximum of 10 mbps on interface 1002-eth3.

Figure 4: Results of Agent A.
Figure 5: Results of Agent B.
Figure 6: Results of Agent C.
Figure 7: Results of Agent D.

Unfortunately, the reward plots do not show any real sign of convergence for any agent. Actions vary wildly, causing flailing behaviour. Even the online-learning agent D, which does not use experience replay, is quite volatile. However, when moving from the explore to the exploit phase, two out of four agent begin to use optimal actions. In this case, these "optimal" actions are to simple set one interface to zero bandwidth, and keep the other at maximum 10 mbit. This is an unfair distribution, but technically maximizes network utilization. Average queue length of the congested interface goes to zero while bandwidth remains high, which is the ultimate goal of our scenario.

The agents which succeeded in this emulation are Agent A and C, both of which employ the DDPG algorithm with (convolutional) neural nets. Interestingly, the agents have wildly different approaches to interface management during training. While A tries to keep one interface high and the other low (but not at zero), Agent C oscillates between maximum and minimum distributions. We are planning to investigate this difference in behaviour

Agent B which uses the one-hot encoding fails. Its reward does not even come close to reaching the maximum (a value of 2.0) and it merely learns to set all bandwidth to zero. This failure may be explained due to the fact that predicting on single interfaces does not account for the whole system state.

The online-learning agent D exhibits a curious pattern. While it achieves maximum utilization, it switches back to a congested model periodically. The exact cause is unclear to us.

Based on these results, we are (for now) moving forward with Agent A and C. While these initial graphs are promising, the scenario is far too simplistic. We intend to model more complex scenarios and flow patterns. Our first step is to factor in standard deviation as penalty to guarantee fairness. This should ultimately result in a 5 mbps distribution per host-interface.


The approach we have taken is not without problems. In fact there are many concerns and limitations which we discuss in this section. This project is a major merge of two research domains, causing constraints on both ends.

Limitations in the AI Domain

As can be seen, our results are underwhelming, even on the simplest possible model. We have several concerns on the theoretical side which may explain this performance.

Lack of Theoretical and Foundational Knowledge

Our knowledge of the reinforcement learning, agent, and AI space is limited. We have decided to go with the current "hot" approach of deep reinforcement learning. This field however requires significant expertise and study to define a proper, well-functioning model. To achieve convergence and stability, large teams are working on deep reinforcement learning algorithms over long periods of time. Our model and platform is continuously improving, but we have yet to reach a stage of confidence. Many of our attempts are trial and error and may be shortsighted due to a lack of general knowledge. Our current approach to this research is akin to alchemy. We merely tweak variables and observe the behaviour, but we do not understand the implications and the effects of any of our adjustments. A lot of our decisions are based on guesswork. In general, we are unaware of the overall space of research and methods which may be more suitable for the scenario we described. While we are not working with a "zero-knowledge" model, the amount of rules our agent has is minimal. It may certainly be beneficial to add constraints in order to produce more sensible actions, but unfortunately we currently lack the knowledge how to encode this constraints in the system appropriately.

Due to time limitations, we have also not tried other, simpler reinforcement learning approaches such as linear policy gradients or radial basis functions. As baseline, we intend to also implement simpler, easier understandable models and compare their performance to the complex neural net architectures. One potential, more advanced, method we are considering is a linear policy gradient developed by Rajeswaran, Aravind, et al. [21].

Reinforcement Learning is Hard

In general, the descriptions in this and this blog post mirror our experience quite well. Deep reinforcement learning has significant problems in reproducibility and success rate. In particular, the DDPG algorithm is said to be reliable and "deterministic", but we have yet to observe determinism in our experiments. This could be due to many reasons, such as bugs in the implementation or a confusing state and reward model. It is unclear, which change could let to a functional model. For example, replay buffer can significantly affect the performance of a deep RL model[22].

Another constraint is the domain we are working in. Networks alone are already very complex systems, with rippling effects and slow momentum. Changes do not apply immediately and positive effects are often only seen many iterations later. This makes is hard to assign reward properly and may prevent the agent from understanding the environment.

Until we fully understand the class of problem we are dealing with, it is uncertain if we can achieve proper success. However, we believe that the initial results are promising and warrant further investigation.


  1. Roy, Arjun, et al. "Inside the social network's (datacenter) network." ACM SIGCOMM Computer Communication Review. Vol. 45. No. 4. ACM, 2015.
  2. Kandula, Srikanth, et al. "The nature of data center traffic: measurements & analysis." Proceedings of the 9th ACM SIGCOMM conference on Internet measurement. ACM, 2009.
  3. Perry, Jonathan, et al. "Fastpass: A centralized zero-queue datacenter network." ACM SIGCOMM Computer Communication Review 44.4 (2015): 307-318.
  4. Cho, Inho, Keon Jang, and Dongsu Han. "Credit-Scheduled Delay-Bounded Congestion Control for Datacenters." Proceedings of the Conference of the ACM Special Interest Group on Data Communication. ACM, 2017.
  5. Sivaraman, Anirudh, et al. "An experimental study of the learnability of congestion control." ACM SIGCOMM Computer Communication Review. Vol. 44. No. 4. ACM, 2014.
  6. Benson, Theophilus, et al. "MicroTE: Fine grained traffic engineering for data centers." Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies. ACM, 2011.
  7. Kandula, Srikanth, et al. "The nature of data center traffic: measurements & analysis." Proceedings of the 9th ACM SIGCOMM conference on Internet measurement. ACM, 2009.
  8. Mirza, Mariyam, et al. "A machine learning approach to TCP throughput prediction." ACM SIGMETRICS Performance Evaluation Review. Vol. 35. No. 1. ACM, 2007.
  9. Tesauro, Gerald, et al. "Managing power consumption and performance of computing systems using reinforcement learning." Advances in Neural Information Processing Systems. 2008.
  10. Mao, Hongzi, et al. "Resource management with deep reinforcement learning." Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 2016.
  11. Zaki, Yasir, et al. "Adaptive congestion control for unpredictable cellular networks." ACM SIGCOMM Computer Communication Review. Vol. 45. No. 4. ACM, 2015.
  12. Valadarsky, Asaf, et al. "Learning to route." Proceedings of the 16th ACM Workshop on Hot Topics in Networks. ACM, 2017.
  13. Dong, Mo, et al. "PCC} Vivace: Online-Learning Congestion Control}." 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). USENIX} Association}, 2018.
  14. Schapira, Michael, and Keith Winstein. "Congestion-Control Throwdown." Proceedings of the 16th ACM Workshop on Hot Topics in Networks. ACM, 2017.
  15. Gu, Shixiang, et al. "Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates." Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017.
  16. Vinyals, Oriol, et al. "StarCraft II: a new challenge for reinforcement learning." arXiv preprint arXiv:1708.04782 (2017).
  17. Lample, Guillaume, and Devendra Singh Chaplot. "Playing FPS Games with Deep Reinforcement Learning." AAAI. 2017.
  18. 18.0 18.1 Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  19. Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016.
  20. Lantz, Bob, Brandon Heller, and Nick McKeown. "A network in a laptop: rapid prototyping for software-defined networks." Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks. ACM, 2010.
  21. Rajeswaran, Aravind, et al. "Towards generalization and simplicity in continuous control." Advances in Neural Information Processing Systems. 2017.
  22. Zhang, Shangtong, and Richard S. Sutton. "A Deeper Look at Experience Replay." arXiv preprint arXiv:1712.01275 (2017).
Some rights reserved
Permission is granted to copy, distribute and/or modify this document according to the terms in Creative Commons License, Attribution-NonCommercial-ShareAlike 3.0. The full text of this license may be found here: CC by-nc-sa 3.0