Towards a Distributed Federated Learning Aggregation Placement using Particle Swarm Intelligence

Amir Ali-Pour
laya Samizadeh
ÉTS Montréal / Université du Québec
amir.ali-pour@etsmtl.ca
ÉTS Montréal / Université du Québec
laya.samizadeh.1@ens.etsmtl.ca
   Sadra Bekrani
Julien Gascon-Samson
Islamic Azad University of Bojnord
sadra.bekrani@iau.ir
ÉTS Montréal / Université du Québec
julien.gascon-samson@etsmtl.ca
Abstract

Federated learning has become a promising distributed learning concept with extra insurance on data privacy. Extensive studies on various models of Federated learning have been done since the coinage of its term. One of the important derivatives of federated learning is hierarchical semi-decentralized federated learning, which distributes the load of the aggregation task over multiple nodes and parallelizes the aggregation workload at the breadth of each level of the hierarchy. Various methods have also been proposed to perform inter-cluster and intra-cluster aggregation optimally. Most of the solutions nonetheless require monitoring the nodes’ performance and resource consumption at each round, which necessitates frequently exchanging systematic data. To optimally perform distributed aggregation in SDFL with minimal reliance on systematic data, we propose Flag-Swap, a Particle Swarm Optimization (PSO) method that optimizes the aggregation placement according only to the processing delay. Our simulation results show that PSO-based placement can find the optimal placement relatively fast, even in scenarios with many clients as candidates for aggregation. Our real-world docker-based implementation of Flag-Swap over the recently emerged FL framework shows superior performance compared to black-box-based deterministic placement strategies, with about 43%percent4343\%43 % minutes faster than random placement, and 32%percent3232\%32 % minutes faster than uniform placement, in terms of total processing time.

Index Terms:
Distributed Systems, Federated Learning, Aggregation, Task Placement, Swarm Intelligence, Black-box Optimization

I Introduction

Federated Learning (FL) has emerged as a revolutionary approach to distributed machine learning within Internet of Things (IoT) ecosystems [1, 2]. With the rapid expansion of IoT devices, vast amounts of decentralized data are being generated at the network edge, posing significant challenges for traditional centralized learning methods. These conventional approaches require transferring data to central servers, leading to high bandwidth costs, increased latency, and serious privacy concerns. In contrast, FL facilitates collaborative model training directly on edge devices, allowing them to contribute to a shared global model without transmitting raw data [3]. This capability is particularly beneficial for IoT environments, where efficient bandwidth utilization, enhanced privacy, and real-time responsiveness are crucial. By minimizing data transmission, preserving data privacy, and optimizing edge computational resources, FL effectively overcomes key limitations of centralized learning [4, 5].

The key part of FL ecosystems is the aggregators, which are nodes that accumulate model parameters or their gradients from the individual nodes and accumulate them using various aggregation methods. The aggregation yields a new set of model parameter values that speculatively represent the learned features from all the contributing nodes’ data. Various FL schematics exist, which depend heavily on the underlying network topology. There are three main categories: 1) Central FL (CFL) is the conventional FL model which is based on the client/server communication model, and follows a star topology, wherein one central unit (i.e., parameter server or aggregation server) is responsible for performing the global model update, and thus all the contributing clients would send their model parameters to that central unit. 2) Fully Decentralized FL (DFL) is a model that follows a P2P communication method, and no central unit is dedicated to aggregation. Instead, model parameters are aggregated after each hop at the destination client machine. 3) Semi-Decentralized FL (SDFL) is a hybrid model between the CFL and DFL, wherein the aggregation load is spread down onto multiple machines, and the aggregator machines either synchronously or asynchronously deliver the aggregation with mutual agreement on the global model updates. This FL model promises parallelism while avoiding a single point of failure, given that the aggregation is distributed across multiple nodes, thus the system is more resilient to node failures or connectivity issues. One of the known SDFL models is the Hierarchical SDFL, in which the aggregation is spread not only specially at the breadth of each hierarchy level but also temporally between each hierarchy level [24]. On top of the advantages of SDFL, Hierarchical SDFL promises scalability, reduced computation bottleneck, and better adaptation to system constraints. From hereon we refer to Hierarchical SDFL as SDFL.

One of the key challenges in SDFL is to find a set of suitable machines as aggregators. The criteria behind choosing an aggregator machine can be bound to several parameters, including key systematic parameters such as the availability of the machine, its computation resources, and communication bandwidth. Several methods have been proposed [8] that use different sets of parameters to create the criterion for developing or optimizing the search for a suitable aggregation site. Nonetheless, most of these methods require the contributing clients to inform the coordinator of their internal performance, which could impose challenges such as network congestion if such data is requested frequently, or violate the privacy of the contributing clients. In contrast to such methods, task placement methods that follow black-box system optimization exist which have seldom been practiced for SDFL. Given that, one can incorporate such an optimization, and in turn, guarantee an optimal placement of aggregation while avoiding transmission and additional processing of the client machines’ internal performance for a supervised optimization. In this paper, we set the goal to investigate the efficacy of using such optimizers. Specifically, we propose using the particle swarm optimization (PSO) method to progressively improve the placement of aggregation. We show that we can improve the placement of aggregation with PSO with regard only to the global processing delay at each FL round. We also demonstrate that PSO imposes marginal computational complexity, given if a suitable FL framework is used that supports Hierarchical FL implementation, making the optimizer a suitable candidate for constrained systems at the edge. Following is the list of contributions we deliver:

  • A black-box PSO-based aggregation placement for SDFL

  • Evaluation of the efficacy of the optimizer in various simulated SDFL scenarios with different numbers of clients and varied depth and width in the hierarchy model.

  • Evaluation of the efficacy of the optimizer in a real SDFL ecosystem based on MQTT communication deployed on docker containers.

  • Comparison with random placement and uniform placement based on round-robin algorithm.

The rest of the paper is as follows: Section II presents the motivation behind employing a black-box optimizer in an SDFL based on the Publish/Subscribe communication model. Section III explains the key features and the mechanism of the proposed optimizer for aggregation placement. Section IV describes the experimental setup and the experimental results using both simulation and real-world deployment. Section V is a discussion of the related works. Section VI concludes the paper.

II Motivation

A commonplace communication model for SDFL would be the Client/Server model similar to CFL [7]. While this model is effective for systems with substantial computational resources and stable network connections, it is not well-suited for environments with resource-constrained devices, such as those found in IoT networks. In such scenarios, dynamic role management, where devices alternate as aggregators to mitigate overload and device exhaustion, becomes essential. Implementing this in a client/server architecture would require complex mechanisms for dynamic role assignment. Alternatively, a fully decentralized peer-to-peer (P2P) approach can ensure that aggregation roles are distributed effectively, though it incurs a training time overhead due to sequential communication.

Lately, a proposition was made to use the Publish/Subscribe communication model instead of Client/Server [21]. They integrate such a service, which only requires a broker at the edge to disseminate the model updates, while the FL-specific roles are delegated to the devices that need the ML services. Therefore, at the edge, role association would be as general as just a message disseminator which does not need any adaptation to the FL process. For instance, if an MQTT broker is running as a service on an edge server, we can connect to that and establish the FL roles among the devices connected to the broker. This would in turn help set up the framework faster and with reduced cost of installment.

Refer to caption
Figure 1: Overview of Parameter sharing for aggregation using Pub/Sub communication in a Clustered Semi-Decentralized Federated Learning Topology.

SDFL over MQTT is a promising practice, that provides simplified orchestration, avoids single point of failure, and increases redundancy. Role association and role management in SDFL over MQTT as described in [21] can be managed relatively easily compared to SDFL implementations using other FL frameworks. This is because in SDFLMQ, FL roles are associated to topics. Following that, candidates for each role can choose to subscribe to their role’s topic, and clients that want to communicate to a node with a specific role, can publish to that role’s topic. The simplicity of role management in this SDFL model helps save time and energy in changing FL’s actor roles during the FL process. Additionally, it opens more room to develop more sophisticated optimization algorithms.

Regarding load balancing and task scheduling, numerous techniques can be used to solve this problem. However, in the context of SDFLMQ as described in [21], one can notice that there is anonymity in the contribution of clients to the FL process. Meaning that clients do not share any information about their internal status to register their candidacy for aggregation with the coordinator. This anonymity in turn enables further expandability and upholds clients’ data privacy. Nonetheless, as mentioned earlier, most of the load-balancing techniques for SDFL need to process clients’ systematic data to choose suitable sites for aggregation. To be able to perform aggregation placement without requiring such data, one can think of incorporating black-box-based optimization techniques. These techniques can perform optimization with only some macro measurements of the entire system such as the total processing delay, or total energy consumption. Solutions that fall into black-box optimization could be evolution strategies, Bayesian optimization, ant colony optimization, genetic algorithm (GA), swarm intelligence, reinforcement learning, etc [22, 23, 25].

While most of these algorithms are potentially applicable to solving the aggregation placement in SDFL, PSO can be found the most potential, mainly due to its convergence speed. Several studies compared PSO to other algorithms such as GA, and concluded that PSO in turn has better performance and convergence whereas GA yields premature convergence [23]. Given that we aim to optimize the aggregation placement with regards to total processing delay, better performance in the optimizer algorithm of course can lead to better placement which in SDFL would lead to lowered total processing delay. Fast convergence also means that we would go through fewer trials until we reach a status where all suggestions (i.e., particles in PSO) lead to a local/global best placement. Given that, it is justified to implement a placement optimizer in SDFL using PSO. In the following, we explain our aggregation placement optimizer based on PSO for SDFL over MQTT.

Refer to caption
Figure 2: Proposed PSO-based aggregation placement in SDFL.

III Proposed Method

In our black‐box PSO approach, clients do not share their internal performance metrics. The coordinator records the processing time of each round and computes the processing delay by subtracting the round’s start time from the round’s ending time. This in turn elevates the necessity of each client informing the coordinator of the internal performance or processing delay, thus significantly reducing the communication load while preserving the privacy of each client. The core objective of our method is to progressively minimize the total processing delay (TPD) of the FL rounds through the PSO optimization loop. Fig. 2 shows the general overview of agg placement in SDFL using PSO.

To achieve optimal placement, we update the clients’ roles by efficiently arranging them as either trainers or aggregators before the beginning of each round. By leveraging the global search capabilities of PSO, the method explores a vast solution space of possible client arrangements and identifies configurations that lead to reduced latency, critical for scalability and real-time performance. Thus, at each round, after computing the processing delay of the previous round, PSO suggests a new arrangement according to its particles. The PSO particles are also updated after each PSO fitness round according to the local and global particle fitness values.

III-A Particle Swarm Optimization for Client Placement

We employ PSO to optimize the assignment of clients to aggregator roles within the hierarchy. In this formulation:

  • Particle Representation: Each particle represents a potential arrangement solution. Each element in the vector is a client ID assigned to an aggregator slot.

  • Swarm: A population of P𝑃Pitalic_P particles explores the solution space.

  • Velocity: Each particle has a velocity vector that dictates how its position changes in each iteration.

III-B Fitness Function

The quality of a client arrangement is evaluated using a fitness function based on the Total Processing Delay (TPD). The fitness f𝑓fitalic_f of an arrangement is:

f=T𝑓𝑇f=-Titalic_f = - italic_T (1)

where T𝑇Titalic_T is the TPD of the corresponding FL round. By maximizing f𝑓fitalic_f, we effectively minimize T𝑇Titalic_T. This formulation captures the bottleneck effect at each hierarchy level, ensuring that the arrangement balances the computational load across the hierarchy.

III-C Optimization Loop

The optimization loop in PSO for aggregation placement in SDFL is the following:

  • A swarm of N𝑁Nitalic_N particles is initialized (e.g., N=10𝑁10N=10italic_N = 10).

  • The initial position of each particle is a random permutation of client IDs assigned to aggregator roles.

  • Initial velocities are set to zero.

  • The personal best position of each particle is its initial position, and the global best position is the position yielding the highest initial fitness.

The optimization loop steps are the following:

  1. 1.

    Velocity Update:

    vit+1=wvit+c1r1(pixit)+c2r2(gxit)superscriptsubscript𝑣𝑖𝑡1𝑤superscriptsubscript𝑣𝑖𝑡subscript𝑐1subscript𝑟1subscript𝑝𝑖superscriptsubscript𝑥𝑖𝑡subscript𝑐2subscript𝑟2𝑔superscriptsubscript𝑥𝑖𝑡v_{i}^{t+1}=w\cdot v_{i}^{t}+c_{1}\cdot r_{1}\cdot(p_{i}-x_{i}^{t})+c_{2}\cdot r% _{2}\cdot(g-x_{i}^{t})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_w ⋅ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ( italic_g - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (2)

    where:

    • vitsuperscriptsubscript𝑣𝑖𝑡v_{i}^{t}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT: Velocity vector of particle i𝑖iitalic_i at iteration t𝑡titalic_t.

    • xitsuperscriptsubscript𝑥𝑖𝑡x_{i}^{t}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT: Position of particle i𝑖iitalic_iat iteration t𝑡titalic_t.

    • pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: Personal best position of particle i𝑖iitalic_i.

    • g𝑔gitalic_g: Global best position.

    • w𝑤witalic_w: Inertia weight (e.g., 0.01).

    • c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: Cognitive coefficient (e.g., 0.01).

    • c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: Social coefficient (e.g., 1).

    • r1,r2subscript𝑟1subscript𝑟2r_{1},r_{2}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: Random numbers in [0, 1].

    Velocity components are clamped to the interval [Vmax,Vmax]subscript𝑉subscript𝑉[-V_{\max},V_{\max}][ - italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], where :

    Vmax=max(1,D×velocity_factor)subscript𝑉1𝐷velocity_factorV_{\max}=\max\left(1,\,D\times\textit{velocity\_factor}\right)italic_V start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = roman_max ( 1 , italic_D × velocity_factor ) (3)

    and D𝐷Ditalic_D is the number of dimensions in the search space. For example, a typical value is velocity_factor=0.1velocity_factor0.1\text{velocity\_factor}=0.1velocity_factor = 0.1.

  2. 2.

    Position Update: The new position is computed as:

    xit+1=(xit+vit+1)%client_countsuperscriptsubscript𝑥𝑖𝑡1percentsuperscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑣𝑖𝑡1client_countx_{i}^{t+1}=(x_{i}^{t}+v_{i}^{t+1})\mathbin{\%}\textit{client\_count}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) % client_count (4)

    Duplicates are resolved by incrementing until a unique client ID is found.

  3. 3.

    Hierarchy Rearrangement: After updating a particle’s position:

    • Clients are reassigned aggregator roles based on the updated particles.

    • Remaining clients are assigned trainer roles from a buffer of available labels.

  4. 4.

    Iteration and Convergence: The algorithm iterates for M𝑀Mitalic_M steps, updating personal and global bests when better fitness values are found. This usually happens when the TPD value is converged to a minimum value. The final global best position represents the optimal client placement. Algorithm 1 shows the iterative process of swarm optimization.

Algorithm 1 PSO Algorithm for SDFL
Inputs:
DEPTH𝐷𝐸𝑃𝑇𝐻DEPTHitalic_D italic_E italic_P italic_T italic_H, WIDTH𝑊𝐼𝐷𝑇𝐻WIDTHitalic_W italic_I italic_D italic_T italic_H, popn𝑝𝑜subscript𝑝𝑛pop_{n}italic_p italic_o italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, max_iter𝑚𝑎𝑥_𝑖𝑡𝑒𝑟max\_iteritalic_m italic_a italic_x _ italic_i italic_t italic_e italic_r, iw𝑖𝑤iwitalic_i italic_w, c1𝑐1c1italic_c 1, c2𝑐2c2italic_c 2, velocity_factor𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦_𝑓𝑎𝑐𝑡𝑜𝑟velocity\_factoritalic_v italic_e italic_l italic_o italic_c italic_i italic_t italic_y _ italic_f italic_a italic_c italic_t italic_o italic_r
Initialization:
 Generate hierarchy with aggregators and trainers
 Create popn𝑝𝑜subscript𝑝𝑛pop_{n}italic_p italic_o italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT particles with positions (client assignments)
 Compute initial fitness for each particle
Main Loop:
for iteration1𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛1iteration\leftarrow 1italic_i italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n ← 1 to max_iter𝑚𝑎𝑥_𝑖𝑡𝑒𝑟max\_iteritalic_m italic_a italic_x _ italic_i italic_t italic_e italic_r do
     for each particle p𝑝pitalic_p do
         Update velocity using iw𝑖𝑤iwitalic_i italic_w, c1𝑐1c1italic_c 1, c2𝑐2c2italic_c 2
         Update position based on velocity
         Rebuild hierarchy with new assignments
         Compute new fitness
         if new fitness better than pbest𝑝𝑏𝑒𝑠𝑡pbestitalic_p italic_b italic_e italic_s italic_t then
              Update pbest𝑝𝑏𝑒𝑠𝑡pbestitalic_p italic_b italic_e italic_s italic_t          
         if new fitness better than gbest𝑔𝑏𝑒𝑠𝑡gbestitalic_g italic_b italic_e italic_s italic_t then
              Update gbest𝑔𝑏𝑒𝑠𝑡gbestitalic_g italic_b italic_e italic_s italic_t               
Processing_Fitness Function:
 Traverse hierarchy bottom-up
 Compute memory consumption and delays per level
 Sum maximum delays across levels
 Return fitness, total delay
Refer to caption
(a) Client number 81, DEPTH:3, WIDTH: 5, Num particles:5
Refer to caption
(b) Client number 406, DEPTH:4, WIDTH: 5, Num particles:5
Refer to caption
(c) Client number 853, DEPTH:5, WIDTH: 4, Num particles:5
Refer to caption
(d) Client number 81, DEPTH:3, WIDTH: 5, Num particles:10
Refer to caption
(e) Client number 406, DEPTH:4, WIDTH: 5, Num particles:10
Refer to caption
(f) Client number 853, DEPTH:5, WIDTH: 4, Num particles:10
Figure 3: Simulation results of PSO optimization in aggregation placement in SDHFL.

IV Experimental Setup & Results

IV-A Simulation Model

We model the FL system as a hierarchical tree with a depth D𝐷Ditalic_D and a width W𝑊Witalic_W. The hierarchy comprises clients with two distinct roles:

  • Aggregators (Agtrainers): Nodes responsible for aggregating model updates from their child clients. Each aggregator maintains a processing buffer containing its children, which can be either trainers (for layer D1𝐷1D-1italic_D - 1) or other aggregators.

  • Trainers: Leaf nodes that perform local model training and send updates to their parent aggregators.

Each client cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined by the following attributes:

  • Memory capacity memcapisubscriptmemcap𝑖\textbf{memcap}_{i}memcap start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: The memory capacity of the client.

  • Model data size mdatasizeisubscriptmdatasize𝑖\textbf{mdatasize}_{i}mdatasize start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: The size of the model data processed by the client (fixed at 5 units in this study).

  • Processing speed pspeedisubscriptpspeed𝑖\textbf{pspeed}_{i}pspeed start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: The computational speed of the client, randomly assigned between 5 and 15 units.

  • Client ID client_idisubscriptclient_id𝑖\textbf{client\_id}_{i}client_id start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: A unique identifier for the client.

The hierarchy is constructed recursively starting from a root aggregator at level 0. For each level l𝑙litalic_l (where 0l<D10𝑙𝐷10\leq l<D-10 ≤ italic_l < italic_D - 1), an aggregator has W𝑊Witalic_W child aggregators at level l+1𝑙1l+1italic_l + 1. At the leaf level (l=D1𝑙𝐷1l=D-1italic_l = italic_D - 1), each aggregator is assigned several trainers (e.g., 2 in our simulation model). The total number of aggregator positions, or dimensions, is computed as:

dimensions=i=0D1Widimensionssuperscriptsubscript𝑖0𝐷1superscript𝑊𝑖\textit{dimensions}=\sum_{i=0}^{D-1}W^{i}dimensions = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (5)

This represents the number of slots in the hierarchy where clients can be assigned as aggregators. The fitness function f𝑓fitalic_f is implemented as the following: We first use Breadth-First Traversal (BFT) to organize the hierarchy into levels, starting from root. Then, we calculate the TPD by processing these levels from the bottom (leaf nodes) to the top (root). For each level, we determine the maximum cluster delay among all aggregators, and the TPD is the sum of these maximum delays across all levels. For an aggregator a𝑎aitalic_a, the cluster delay dasubscript𝑑𝑎d_{a}italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is defined as:

da=mdatasizea+cchildren(a)mdatasizecpspeedasubscript𝑑𝑎subscriptmdatasize𝑎subscript𝑐children𝑎subscriptmdatasize𝑐subscriptpspeed𝑎d_{a}=\frac{\textit{mdatasize}_{a}+\sum_{c\in\textit{children}(a)}\textit{% mdatasize}_{c}}{\textit{pspeed}_{a}}italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG mdatasize start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_c ∈ children ( italic_a ) end_POSTSUBSCRIPT mdatasize start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG pspeed start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG (6)

where children(a)children𝑎\text{children}(a)children ( italic_a ) denotes the set of clients in a𝑎aitalic_a’s processing buffer. The total processing delay (TPD) T𝑇Titalic_T is:

T=levelsmaxalevelda𝑇subscriptlevelssubscript𝑎levelsubscript𝑑𝑎T=\sum_{\textit{levels}}\max_{a\in\textit{level}}d_{a}italic_T = ∑ start_POSTSUBSCRIPT levels end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a ∈ level end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT (7)

IV-B Simulation setup & results

A simulation was implemented featuring an SDFL system with a hierarchical structure of depth N{3,4,5}𝑁345N\in\{3,4,5\}italic_N ∈ { 3 , 4 , 5 } and width M{4,5}𝑀45M\in\{4,5\}italic_M ∈ { 4 , 5 }, constructed via breadth-first traversal to ensure balanced role distribution. Clients within this hierarchy are categorized as either aggregators or trainers. Each simulated client node has a processing buffer that is used to keep their child nodes within an array, and if those child nodes are also aggregators, they maintain their non-empty processing buffers. Trainer nodes also have processing buffers, which remain empty. Trainers retain these buffers because their role might change later, potentially transitioning into an aggregator position. Each client is assigned random attributes, including memory capacity 10<m<5010𝑚5010<m<5010 < italic_m < 50, processing speed 5<ps<155𝑝𝑠155<ps<155 < italic_p italic_s < 15 units, and a uniform model data size fixed at 5555. The PSO-based role assignment changes the position of simulated client nodes in the hierarchy which in turn affects the TPD. Note that the Total Processing Delay (TPD) is calculated as the sum of the maximum cluster delays in each level of the hierarchical structure. The role adjustments lead to minimizing the TPD across the system.

Optimization of client role assignments is achieved through PSO utilizing a swarm of P{5,10}𝑃510P\in\{5,10\}italic_P ∈ { 5 , 10 } particles, each representing a potential configuration of the hierarchical structure. Note that each particle indicates the position of the aggregator clients. Trainer clients will be assigned randomly as the terminal node to the aggregators. The PSO algorithm is configured with an inertia weight of 0.010.010.010.01 to favor exploitation, a cognitive coefficient (c1) of 0.010.010.010.01 for stability with the small swarm size, and a social coefficient (c2) of 1111 to emphasize the influence of the global best solution. It iterates for 100100100100 generations, with a velocity factor of 0.10.10.10.1.

Results of the aggregation placement using PSO in simulated SDFL are shown in Fig. 3. Each plot shows the normalized TPD with respect to PSO iterations. Grey curves show the processing delay per PSO particle, and the red, green, and orange curves show the worst, best, and average processing delay at each iteration step, respectively. The key observation her is the convergence of TPD. As expected, PSO particles manage to lead the TPD to a minimum value, up to a point where all the particles suggest the same placement which results in the global minimum TPD. The convergence of all particles to one placement is needed, since at each FL round when a particle is given for a new placement, it is not assured if the particle will lead to a new minimum TPD. The only way is to test the particle and calculate the TPD after the global model is yielded for that round. Once the particles converge, we can ensure that the optimizer has searched the potential placements in the search space while heuristically progressing toward minimizing the TPD.

Moreover, we can also see that PSO adapts well to the increasing number of clients, even though knowing that the dimensionality of the particles in cases with large numbers of clients would be high. We can see this by comparing Fig. 3 (a) with Fig. 3 (b) and Fig. 3 (c), and Fig. 3 (d) with Fig. 3 (e) and Fig. 3 (f). The last observation is the effect of increasing the number of particles. We can see that a larger number of particles can potentially result in finding a better placement leading to an even lower TPD value. This can be seen in comparing the results in Fig. 3 (a) with Fig. 3 (d), or Fig. 3 (b) with Fig. 3 (e), or Fig. 3 (c) with Fig. 3(f).

IV-C Docker-based setup & results

To evaluate the applicability of PSO and it’s potential use in real systems, we integrated our implementation into the SDFLMQ framework’s code which is publicly available at [16], and compared the performance of our method with the builtin placement strategies including random placement and uniform round-robin-based placement. We created one scenario, including 10 docker-container clients, with one client having 2Gb2𝐺𝑏2Gb2 italic_G italic_b dedicated memory and 3333 dedicated cores, two clients with 1Gb1𝐺𝑏1Gb1 italic_G italic_b dedicated memory, 1Gb1𝐺𝑏1Gb1 italic_G italic_b capacity for memory swap, and 1111 core each, and seven clients with 64Mb64𝑀𝑏64Mb64 italic_M italic_b dedicated memory, 2Gb2𝐺𝑏2Gb2 italic_G italic_b capacity for memory swap, and 1111 dedicated core each. We gave a multi-layer perceptron model to each client, with 1.81.81.81.8 million parameters, and about 30Mb30𝑀𝑏30Mb30 italic_M italic_b of size in json format, which is the format used in SDFLMQ to write the model parameters in and transmit in-between SDFLMQ nodes. We run the scenario for 50505050 rounds, and recorded the processing delay at each round, and the total processing delay after 50 rounds. Fig. 4 shows the processing delay, per round for the three placement strategies including random placement, uniform round-robin placement, and PSO-based placement. As can be seen, PSO-based placement was able to converge after the 10thsuperscript10𝑡10^{th}10 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT round. After the convergence, PSO-based placement shows between 20202020 seconds to 30303030 seconds faster processing time per round, compared to random-based and uniform-based placements. The total processing time in PSO-based placement also is significantly better, leading to around 30303030 minutes faster than in random-based placement, and around 20202020 minutes faster than in uniform-based placement.

Overall, the evaluation results presented here hint that PSO has the competency to be integrated in choosing the aggregation sites in semi-decentralized federated learning. Nonetheless, further developments need to be done and studies to be conducted to ensure PSO’s adaptation towards varying SDFL topologies and changing system characteristics.

Refer to caption
Figure 4: Comparing aggregation placement using Random, PSO-based, and Round-Robin-based placement in SDFLMQ

V Related Work

There are various propositions made to use PSO in task scheduling and load balancing, both in the cloud and the Edge. Below are a few most related works regarding placement optimization at the edge.

One key aspect is computing offloading in Mobile Edge Computing (MEC). A study explored a Particle Swarm Optimization (PSO)-based task offloading strategy for 5G-enabled Industrial Internet of Things (IIoT) environments, optimizing energy efficiency and latency by distributing tasks among heterogeneous edge servers [17]. The PSO approach was compared with Genetic Algorithm (GA) and Simulated Annealing (SA), demonstrating its advantages in reducing latency and balancing energy consumption [17].

Cloud computing task scheduling is another critical area. Researchers proposed a hybrid PSO-Genetic Algorithm (PSO-PGA) incorporating a phagocytosis mechanism to expand the search space and avoid local optima in cloud task scheduling [18]. The phagocytosis mechanism, inspired by biological immune responses, allows weaker solutions to be engulfed and replaced by stronger ones, thereby maintaining diversity and preventing premature convergence. The study demonstrated improved completion times and convergence accuracy compared to traditional PSO and GA approaches [18].

Another study introduced a novel task scheduling approach in cloud computing using Dynamic Dispatch Queues (TSDQ) combined with hybrid meta-heuristic algorithms [19]. Two variations, one using Fuzzy Logic with PSO (FLPSO) and another integrating Simulated Annealing with PSO (SAPSO), were tested. The results indicated that FLPSO significantly reduced waiting time, queue length, makespan, and execution cost, beating other state-of-the-art scheduling strategies [19].

Furthermore, edge aggregation and server placement in SDFL have been explored to address device association and resource allocation challenges. A study formulated an edge aggregation optimization problem and converted it into a dynamic optimization problem based on training loss degradation [9]. It introduced a Trilateral Matching-based Association (TMA) approach for efficient device association and resource allocation, which employs the classic Hungarian algorithm to derive the ideal matching set. Additionally, a Tabu Search-based Placement (TSP) approach was proposed to optimize the placement of edge servers. The combination of TMA and TSP in an iterative manner improved device participation reliability and edge aggregation efficiency [9].

An adaptive PSO-based scheduling approach (AdPSO) was also proposed to optimize task execution in cloud computing [20]. This study introduced a new inertia weight strategy called Linearly Descending and Adaptive Inertia Weight (LDAIW) to improve the balance between local and global search. Experimental results showed that AdPSO achieved up to a 10 % improvement in makespan, a 12 % improvement in throughput, and a 60 % improvement in resource utilization compared to existing PSO-based scheduling strategies [20].

Overall, existing research provides various optimization techniques for task scheduling and offloading in edge and cloud environments. However, open challenges remain in balancing energy consumption, latency, and computational efficiency in SDFL systems, necessitating further exploration of hybrid meta-heuristic algorithms as black-box optimizers.

VI Conclusion

In this paper, we explored the usability of PSO as a black-box optimizer for aggregation placement in hierarchical semi-decentralized federated learning. We discussed that compared to other meta-heuristics, PSO shows faster and more accurate convergence. Our simulations and Docker-based implementations demonstrated that PSO efficiently optimizes client placement, reducing processing delay by balancing aggregation load across levels. We showed that PSO adapts well to large client numbers and outperforms random and uniform placement methods. Future work will explore adapting PSO for continuous system variations, adaptive particle sizes, and incorporating additional parameters into the fitness function. We will maintain PSO as a black-box solution and compare it with other meta-heuristic and learning-based approaches.

References

  • [1] Nguyen, Dinh C, Ding, Ming, Pathirana, Pubudu N, Seneviratne, Aruna, Li, Jun, Poor, H Vincent, ”Federated learning for internet of things: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 23, no. 3, pp. 1622–1658, 2021.
  • [2] Zhang, Tuo, Gao, Lei, He, Chaoyang, Zhang, Mi, Krishnamachari, Bhaskar, Avestimehr, A Salman, ”Federated learning for the internet of things: Applications, challenges, and opportunities,” IEEE Internet of Things Magazine, vol. 5, no. 1, pp. 24–29, 2022.
  • [3] Lim, Wei Yang Bryan, Luong, Nguyen Cong, Hoang, Dinh Thai, Jiao, Yutao, Liang, Ying-Chang, Yang, Qiang, Niyato, Dusit, Miao, Chunyan, ”Federated learning in mobile edge networks: A comprehensive survey,” IEEE communications surveys & tutorials, vol. 22, no. 3, pp. 2031–2063, 2020.
  • [4] Ji, Xiuzhao, Tian, Jie, Zhang, Haixia, Wu, Dalei, Li, Tiantian, ”Joint device selection and bandwidth allocation for cost-efficient federated learning in industrial internet of things,” IEEE Internet of Things Journal, vol. 10, no. 10, pp. 9148–9160, 2023.
  • [5] Guo, Yinghao, Zhao, Zichao, He, Ke, Lai, Shiwei, Xia, Junjuan, Fan, Lisheng, ”Efficient and flexible management for industrial internet of things: A federated learning approach,” Computer Networks, vol. 192, pp. 108122, 2021.
  • [6] Bonawitz, Keith, ”Towards federated learning at scale: Syste m design,” arXiv preprint arXiv:1902.01046, 2019.
  • [7] Beltrán, Enrique Tomás Martínez, Pérez, Mario Quiles, Sánchez, Pedro Miguel Sánchez, Bernal, Sergio López, Bovet, Gérôme, Pérez, Manuel Gil, Pérez, Gregorio Martínez, Celdrán, Alberto Huertas, ”Decentralized federated learning: Fundamentals, state of the art, frameworks, trends, and challenges,” IEEE Communications Surveys & Tutorials, 2023.
  • [8] Luo, Siqi, Chen, Xu, Wu, Qiong, Zhou, Zhi, Yu, Shuai, ”HFEL: Joint edge association and resource allocation for cost-efficient hierarchical federated edge learning,” IEEE Transactions on Wireless Communications, vol. 19, no. 10, pp. 6535–6548, 2020.
  • [9] Xu, Bo, Zhao, Haitao, Cao, Haotong, Garg, Sahil, Kaddoum, Georges, Hassan, Mohammad Mehedi, ”Edge aggregation placement for semi-decentralized federated learning in Industrial Internet of Things,” Future Generation Computer Systems, vol. 150, pp. 160–170, 2024.
  • [10] Ziller, Alexander, Trask, Andrew, Lopardo, Antonio, Szymkow, Benjamin, Wagner, Bobby, Bluemke, Emma, Nounahon, Jean-Mickael, Passerat-Palmbach, Jonathan, Prakash, Kritika, Rose, Nick, others, ”Pysyft: A library for easy federated learning,” Federated Learning Systems: Towards Next-Generation AI, pp. 111–139, 2021.
  • [11] He, Chaoyang, Li, Songze, So, Jinhyun, Zeng, Xiao, Zhang, Mi, Wang, Hongyi, Wang, Xiaoyang, Vepakomma, Praneeth, Singh, Abhishek, Qiu, Hang, others, ”Fedml: A research library and benchmark for federated machine learning,” arXiv preprint arXiv:2007.13518, 2020.
  • [12] Beutel, Daniel J, Topal, Taner, Mathur, Akhil, Qiu, Xinchi, Fernandez-Marques, Javier, Gao, Yan, Sani, Lorenzo, Li, Kwing Hei, Parcollet, Titouan, de Gusmão, Pedro Porto Buarque, others, ”Flower: A friendly federated learning research framework,” arXiv preprint arXiv:2007.14390, 2020.
  • [13] Lin, Frank Po-Chen, Hosseinalipour, Seyyedali, Azam, Sheikh Shams, Brinton, Christopher G, Michelusi, Nicolo, ”Semi-decentralized federated learning with cooperative D2D local model aggregations,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 12, pp. 3851–3869, 2021.
  • [14] Beltrán, Enrique Tomás Martínez, Gómez, Ángel Luis Perales, Feng, Chao, Sánchez, Pedro Miguel Sánchez, Bernal, Sergio López, Bovet, Gérôme, Pérez, Manuel Gil, Pérez, Gregorio Martínez, Celdrán, Alberto Huertas, ”Fedstellar: A platform for decentralized federated learning,” Expert Systems with Applications, vol. 242, pp. 122861, 2024.
  • [15] Sun, Yuchang, Shao, Jiawei, Mao, Yuyi, Wang, Jessie Hui, Zhang, Jun, ”Semi-decentralized federated edge learning with data and device heterogeneity,” IEEE Transactions on Network and Service Management, vol. 20, no. 2, pp. 1487–1501, 2023.
  • [16] , ”SDFLMQ Python Source code,” , 2025.
  • [17] You, Qian, Tang, Bing, ”Efficient task offloading using particle swarm optimization algorithm in edge computing for industrial internet of things,” Journal of Cloud Computing, vol. 10, pp. 1–11, 2021.
  • [18] Fu, Xueliang, Sun, Yang, Wang, Haifang, Li, Honghui, ”Task scheduling of cloud computing based on hybrid particle swarm algorithm and genetic algorithm,” Cluster Computing, vol. 26, no. 5, pp. 2479–2488, 2023.
  • [19] Ben Alla, Hicham, Ben Alla, Said, Touhafi, Abdellah, Ezzati, Abdellah, ”A novel task scheduling approach based on dynamic queues and hybrid meta-heuristic algorithms for cloud computing environment,” Cluster Computing, vol. 21, no. 4, pp. 1797–1820, 2018.
  • [20] Nabi, Said, Ahmad, Masroor, Ibrahim, Muhammad, Hamam, Habib, ”AdPSO: adaptive PSO-based task scheduling approach for cloud computing,” Sensors, vol. 22, no. 3, pp. 920, 2022.
  • [21] Amir Ali-Pour, Julien Gascon-Samson, ”SDFLMQ: A Semi-Decentralized Federated Learning Framework over MQTT,” arXiv preprint, 2025.
  • [22] Meunier, Laurent, Rakotoarison, Herilalaina, Wong, Pak Kan, Roziere, Baptiste, Rapin, Jérémy, Teytaud, Olivier, Moreau, Antoine, Doerr, Carola, ”Black-box optimization revisited: Improving algorithm selection wizards through massive benchmarking,” IEEE Transactions on Evolutionary Computation, vol. 26, no. 3, pp. 490–500, 2021.
  • [23] Boveiri, Hamid Reza, Khayami, Raouf, ”On the performance of metaheuristics: A different perspective,” arXiv preprint arXiv:2001.08928, 2020.
  • [24] Liu, Lumin, Zhang, Jun, Song, SH, Letaief, Khaled B, ”Client-edge-cloud hierarchical federated learning,” , pp. 1–6, 2020.
  • [25] Auger, Anne, Hansen, Nikolaus, Perez Zerpa, Jorge M, Ros, Raymond, Schoenauer, Marc, ”Experimental Comparisons of Derivative Free Optimization Algorithms: (Invited Talk),” , pp. 3–15, 2009.