In federated learning systems, a server maintains a global model trained by a set of clients based on their local datasets. Conventional synchronous FL systems are very sensitive to system heterogeneity since the server needs to wait for the slowest clients in each round. Asynchr
...
In federated learning systems, a server maintains a global model trained by a set of clients based on their local datasets. Conventional synchronous FL systems are very sensitive to system heterogeneity since the server needs to wait for the slowest clients in each round. Asynchronous fl partially addresses this bottleneck by dealing with updates once they are received. But with a single server, the system performance would be influenced if the clients are located far from the server and require very high communication costs. Another issue in single-server settings is that the client scale is limited since the server can be overloaded with heavy communication and computation workload. Moreover, a crash on the central server is fatal to the single-server system. Multi-server FLreduces the average communication cost by decreasing the distance between servers and clients. However, the bottleneck brought by the slowest clients still exists in multi-server systems that preserve synchrony, such as Hierarchical FL. The approach we follow in this paper consists in replicating the server in a way that the global training process remains asynchronous. We propose MultiAsync, a novel asynchronous multi-server FL framework that aims to address the single-server and synchronous-system bottleneck.