The Application of RDMA over Converged Ethernet Data Transport for Radio-Astronomy Systems

More Info
expand_more

Abstract

The need to receive and process higher data rates in computer clusters is an ever-increasing trend. This also applies to radio-astronomic systems, which have become more distributed over the past decades, increasing data traffic between antennas and processing facilities. At the antenna, Field Programmable Gate Arrays (FPGAs) digitise the radio signals and often perform the first stage of signal processing at the antenna. Hereafter, the data is sent from the FPGAs to computer clusters, where further processing is accomplished on CPUs and GPUs. Currently used protocols for data transport between FPGAs and CPUs, such as UDP, are insufficiently scalable for higher data rates since these heavily load the receiving CPU.
Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) was developed to overcome the disadvantages of the standard UDP and TCP protocols by using the CPU to orchestrate data transfers and not engaging the CPU in the data path. With this, the data bypasses the CPU, lowering the CPU workload and enabling throughput and latency improvements. However, the data transport methods for applications in radio-telescopes must meet specific characteristics of this application area, such as public Ethernet routing, high sustained data rates, real-time processing and implementable on FPGAs.

This work examines whether RoCE can reduce system load and increase throughput in sending and receiving antenna data. For this purpose, we characterise the data transport, derive the best RoCE configuration for the intended application and asses whether RoCE can achieve the required performance. The protocol analysis concluded that an unreliable connection transport service with RDMA WRITE with immediate data is best suited for the application in radio telescope systems.
This thesis defines a methodology to examine the impact of RoCE settings and RoCEs scalability on a representative cluster setup with CPU-CPU and CPU-GPU data transport. To conduct these tests, an application is developed with abstractions such that it can be easily reused. First, standard tooling demonstrated that RoCE achieves ∼2x higher goodput and 3x lower CPU utilisation compared to UDP, indicating the possible scale of performance improvement of RoCE. Further performance studies are accomplished through the custom-developed application to explore various settings and network topologies (1-1, 1-N and N-1). For example, we found that using a shared receive queue can reduce CPU utilisation by 50%, and the use of solicited events can yield reductions of up to 70%, with no negative impact on resources and goodput. Direct memory access from the RoCE-enabled NIC to GPU memory is also evaluated, for which comparable performance was achieved to standard main memory in an N-1 setup. We found that RoCE can transport data from multiple transmitters over a total of 2000 QPs with a 16kiB message size to a single receiver at 90Gbps and a CPU load of 40% for one core in the receiver.
The feasibility and performance of transporting data between FPGA and RNIC are also investigated. The implementation could not transport the data from the FPGA into the CPU memory because of an incorrect checksum implementation. Nevertheless, we were able to confirm that it is possible to implement RoCE on an FPGA for use in radio astronomical data transport.