Enhancing MPI remote memory access model for distributed-memory systems through one-sided broadcast implementation

Efficiently processing vast and expanding data volumes is a pressing challenge. Traditional high-performance computers, utilizing distributed-memory architecture and a message-passing model, grapple with synchronization issues, hampering their ability to keep up with the growing demands. Remote Memory Access (RMA), often referred to as one-sided MPI communications, offers a solution by allowing a process to directly access another process’s memory, eliminating the need for message exchange and significantly boosting performance. Unfortunately, the existing MPI RMA standard lacks a collective operation interface, limiting efficiency. To overcome this constraint, we introduce an algorithm design that enables efficient parallelizable collective operations within the RMA framework. Our study focuses primarily on the advantages of collective operations, using the broadcast algorithm as a case study. Our implementations surpass traditional methods, highlighting the promising potential of this technique, as indicated by initial performance tests.


Introduction
High-performance computing (HPC) is constantly evolving to meet the demands of managing and transmitting vast amounts of data in distributed environments.A notable development in this field is the Remote Memory Access (RMA) model, also known as one-sided communications, within the Message Passing Interface (MPI) framework.RMA-calls offer an efficient solution for data distribution and communication in distributed-memory systems, enhancing performance by removing the need for synchronization [1].
RMA operations enable processes to autonomously access memory regions in remote processes, reducing synchronization overhead and improving system efficiency.With the rising demands for computational power and data-intensive applications, realizing the potential of the MPI RMA model becomes crucial for optimizing distributed memory systems [2].
Furthermore, adopting RMA operations leads to substantial cost savings by eliminating expenses related to tag matching, managing premature message arrivals, and handling buffer complexities in point-to-point message passing systems [3].
However, maximizing the MPI RMA model's benefits comes with challenges that require a deep understanding of both its theory and practical implementation.These challenges include preserving data integrity, efficient synchronization management, and mitigating network latency issues [2].
In many cases, program execution involves numerous identical Remote Memory Access (RMA) operations across memory segments [3][4][5][6].Unfortunately, the existing MPI standard limits users to straightforward linear algorithms, which are suboptimal when the program logic allows parallel data access.Our research aims to develop collective operation algorithms within the RMA framework, anticipated to significantly improve performance and reduce energy consumption in MPI programs [7].
To illustrate these concepts, we focus on the widely-used broadcast (one-to-all) algorithm within communication protocols.We select this algorithm for its clarity and ease of implementation, simplifying analysis and debugging processes [8].
Designing collective operations in the MPI RMA model presents unique challenges distinct from conventional message-passing interfaces.Data synchronization is a particular challenge, requiring processes to harmonize interactions and data access in remote memory during RMA-based collective operations.Simultaneous access and modification of shared data by multiple processes make efficient, conflict-free data synchronization essential [9].
PGAS languages, including Cray Chapel, IBM X10, and Unified Parallel C, belong to the category of modern programming languages that enable developers to create parallel applications without the need for explicit management of inter-process communication [10][11][12].Unlike MPI programs, PGAS programs do not involve direct calls to communication functions.Instead, they work with distributed data structures and employ constructs for parallel task management (such as threads and activities) and synchronization.Communication is scheduled by the compiler and executed by the runtime system, which offers seamless access to the memory of remote nodes [10,11,12].
The high level of abstraction provided by the PGAS model simplifies the development of parallel programs but necessitates the creation of efficient compilation optimization methods.In essence, PGAS languages represent a novel approach to parallel programming, streamlining the process by automating communication between processes.However, this automation places a greater onus on the compiler to optimize the code for efficient execution [13].
Our study delves into the MPI model for collective operations, exploring its characteristics, potential advantages and disadvantages, and strategies to overcome its challenges.We also investigate future advancements and prospects of the MPI RMA model for collective operations, complemented by realworld applications [14].
Our goal is to provide a comprehensive guide bridging the gap between theory and practical implementation, enabling researchers and practitioners to maximize the MPI RMA model's potential, enhancing the performance and scalability of their HPC applications on distributed memory systems.
To our knowledge, no prior collective algorithms have been proposed for the MPI one-sided interface (RMA model) [9,14].However, in RMA-based programs, particularly those using passive target synchronization, certain processes need to access specific data in the memories of all other processes.Established collective communication patterns, such as all-to-all, all-to-one, and one-to-all, are required.Despite the absence of collective operations in the current MPI one-sided interface, we anticipate a growing demand for this capability in forthcoming MPI standards, especially in future-generation HPC systems.These collectives are expected to substantially reduce communication complexity, resulting in a notable reduction in overall execution time.

Related works
Historically, extensive research efforts have been dedicated to collective algorithms.Numerous algorithms have been devised for each collective type, each tailored to address specific considerations related to message sizes and the number of processes [14,15].
In contrast to standard MPI collectives, which maintain multiple copies of replicated data for onnode processes, collectives [16,17] adopt a more efficient approach by retaining a single copy of shared data.To implement these collectives, MPI libraries and other parallel programming models employ diverse algorithms, including the ring algorithm, recursive doubling, recursive halving, the J. Bruck algorithm [17], and algorithms that organize processes in various tree structures, such as balanced ktrees, flat trees (linear trees), and pipelines (k-chains) [21][22][23].
The execution times of the aforementioned algorithms differ, predicated on the assumption of uniform communication channels among computational components.However, contemporary systems encompass multiple architectures and exhibit a hierarchical structure, wherein the duration of data transmission between two processor components hinges on their respective positions within the system [21,22].Consequently, a pertinent challenge pertains to the development of topology-aware collectives, which can accommodate the multi-architecture and hierarchical nature of modern HPC systems.

A one-to-all strategy
The operations MPI_Bcast and MPI_Ibcast are implemented using a range of algorithms, including the binomial tree algorithm, k-chain tree algorithm, binary tree algorithm, flat/tree method, and an algorithm that combines MPI_Scatter (binomial tree) with MPI_Allgather (ring, recursive doubling).Additionally, the binomial tree algorithm, flat/linear tree method, MPI_Scatter, MPI_Iscatter, and MPI_Scatterv are employed to execute these operations.
To the best of our research findings, no published techniques have employed the MPI one-sided interface for collective operations.Nevertheless, certain processes within RMA programs may insert or extract specific data into or from the memory of all other processes, especially in the context of passive target synchronization.This corresponds to the well-established collective communication methods of all-to-all, all-to-one, and one-to-all.In such scenarios, a straightforward linear algorithmic approach (with a complexity of O(n)) does not yield optimal results.
Although the current MPI one-sided interface lacks support for collective operations, we anticipate a high demand for this feature in upcoming MPI standards, particularly with the advent of nextgeneration high-performance computing systems.We expect that these collective operations will significantly reduce communication overhead (in comparison to linear algorithms), thereby leading to a notable improvement in overall execution time.

RMA Broadcast
Let p be the number of MPI processes in the MPI communicator.Denote win is an RMA window, and b is a buffer of size n allocated on each MPI process, r is the root process.
In the traditional message-passing model, the broadcast operation, a one-to-all communication scheme, plays a pivotal role in collective communication.In this process, a single process, known as the "root," is responsible for sending a message to all other processes within a designated group, typically defined by an MPI communicator.The root process is tasked with supplying the data to be broadcasted, while the remaining processes, referred to as "non-root" or "receiver" processes, are recipients of the broadcasted data.
However, in the context of RMA, as illustrated in Figure 1, the dynamics of a broadcast operation differ slightly.Here, the root process takes the lead in transmitting data by depositing it into the memories of all processes within the communicator.The non-root processes, serving as receivers, invoke a synchronization routine to ensure that the operation is complete before accessing the memory window to retrieve the broadcasted data.This synchronization mechanism ensures that the non-root processes refrain from accessing the memory window until the root process has successfully populated it with the intended data.We have implemented several RMA broadcast algorithms shown in figure 2, namely Binomial, Binary Tree, and Linear.In our approach to the broadcast strategy, each root allocates space in memory for the broadcast data, which can be shared by its descendants.Adhering to the MPI broadcast semantics, only the root process is allowed to modify the broadcast data.However, by having a local pointer to the beginning of this shared memory area, any process on the same node can independently access the broadcast data.Since the size of the broadcast message remains consistent with that of the pure MPI broadcast as shown in figure 3, performing the across-node broadcast operation across all the roots becomes straightforward in this scenario [13,24].

Binary Tree Algorithm
At the first round in figure 6 process 0 put the buffer b in the memory of processes 1 and 2, at the secondround process 1 put the buffer in the memory of processes 3 and 4, for the third round process 2 puts the message in the buffer of 5 and 6.Finally, at the fourth round, process 3 puts the message in the memory of process 7 [13].The height of the binary tree is equal to TTotal = T log2(p) at each round the maximum number is 2 i for i is the round number.
With each round being shown a doubling of the number of nodes broadcasting.As a result, there are fewer transmission stages [25].To finish the task, the final step (for instance, rank 7 in figure 6) requires log2p communication rounds.(8) in the move data Figure 9c 7. Compute child1 Line (7) then assign the src_buf to dest_buf with the rank Line (9). 8. Otherwise repeat same steps for child2

Binomial Tree Algorithm
Let's describe our Binomial tree algorithm using MPI RMA.The algorithm consists of several rounds.Let's consider an example for p = 8.At the first round (figure 8) process 0 put the buffer in the memory of process 1 (as shown in figure 1) and at the second round process 0 put the buffer in the memory of processes 1 and 2. Finally, at the third round process 0 put the buffer in the memory of processes 1, 2 and 4 in addition to that process 1 put the buffer in the memory of processes 3 and 5 also process 2 put the buffer in the memory of process 6 and finally process 6 put the buffer in the memory of process 7 [15,26].
With each round being shown a doubling of the number of nodes broadcasting.As a result, there are fewer transmission stages from p − 1 to log2(p) and TTotal = T  log2p.There are two fundamental drawbacks of broadcasting down a binomial tree.First, when the communicator size is not a correct power of two, the communication time is obviously out of balance [13].To finish the task, the final step (for instance, rank 7 in figure 8) requires log2p communication rounds.In comp_srank algorithm we compute the shifted rank srank A. This rank is related to root (line 1).In RMABcastBinomial perform writing data at figure 9b: 1. Compute shifted rank srank using figure 7.This rank is related to root (line 1).
2. The process start loop with log2(p) from (lines 3 − 16) 3. Begin synchronization epoch for each process in line 1 figure 9c 4. If srank in (line 2) is not set compute rank (line 7) then assign the src_buf to dest_buf with the rank (line 9).otherwise break the loop (line 13).
5. Synchronizes the private and public window copies of win (line 5) in figure 9c.

Experimental evaluation
In our effort to evaluate the RMA broadcast efficiency, we conducted a series of benchmarks on a cluster containing 5 HP XL250a Gen9 blade servers, each of which contains: two 12-core Intel Xeon E5-2680v3 processors clocked at 2500 MHz, 192 GB RAM.The MPI library used was Open MPI 4.1.0.The benchmark was constructed to simulate realistic computational loads that are representative of our research scenarios.We subjected the server to tests involving 1000 messages spread across eight different data packets.The packet sizes varied in a binary progression, starting from a minimum size of 16 bytes and doubling sequentially until reaching a maximum size of 33 MB.
Latency Benchmark: the primary objective of this test was to measure the latency of the HPC system.This was achieved by evaluating the time it takes for a single message of variable sizes (from 16 Bytes to 33 MB) to travel from one node to another.This test was run 1000 times for each message size to generate average values, mitigating any anomalous results that might skew our assessment.Figure 10 shows the transfer times for the Binomial tree (shared), Binary tree (shared), Linear, Binary tree (one-sided), and Binomial tree (one-sided) algorithms for different data sizes, using the MPI Remote Memory Access Model.For large data transfers (>33MB), the Binomial tree (shared) algorithm is the fastest on all node counts, followed by the Binary tree (shared) algorithm.The Linear algorithm is the slowest for large data transfers on smaller nodes (2-8 nodes).The Binary tree (one-sided) and Binomial tree (one-sided) algorithms are the slowest for large data transfers on larger node counts (9+ nodes).
For example, on 8 nodes, the Binomial tree (shared) algorithm is 7.5% faster than the Binary tree(shared) algorithm and 90% faster than the Linear algorithm for a 33MB data transfer.The performance difference between the Binomial tree (shared) and Binary tree(shared) algorithms decreases as the number of nodes increases, but the Binomial tree (shared) algorithm remains the fastest algorithm on smaller node counts.
The Linear algorithm is the slowest algorithm for large data transfers on smaller nodes because it requires more communication rounds than the other algorithms.Additionally, the Linear algorithm may be more sensitive to the latency of the network, which can explain why it performs worse on smaller nodes.In the Shared Memory model context, the Binomial tree algorithm outshines the other algorithms, demonstrating an optimal performance.The Binary tree follows closely, showcasing nearly equivalent transfer times to the Binary tree, thus indicating its potential efficacy for data transfers of 0.5 MB.
For example, on 4 nodes, the Binomial tree (shared) algorithm is 22% faster than the Binary tree (shared) algorithm and 87% faster than the Linear algorithm for a 33MB data transfer.The performance difference between the Binomial tree (shared) and Binary tree (shared) algorithms decreases as the On the contrary, the performance of the Linear algorithm paints a different picture.For the transmission of medium data-sized messages, it is observed to be the least efficient among the trio of algorithms.Its performance significantly trails behind both the Binary (shared) and Binomial (shared) trees, suggesting it might not be the optimal selection for managing larger data volumes.Figure 12 uncovers a compelling pattern related to the transmission of small messages.
Figure 12 shows the transfer times for the Binomial tree (shared), Binary tree (shared), Linear, Binary tree (one-sided), and Binomial tree (one-sided) algorithms for 8KB data transfers, using the MPI Remote Memory Access Model.
For small data transfers, the Binomial tree (shared) algorithm is the fastest on all node counts, followed by the Linear and Binary tree (shared) algorithms.The Binary tree(one-sided) and Binomial tree (one-sided) algorithms are the slowest.
For example, on 8 nodes, the Binomial tree (shared) algorithm is 5.5% faster than the Binary tree (shared) algorithm and 75% faster than the Binary tree (one-sided) algorithm for an 8KB data transfer.The Linear algorithm is slower than the Binomial tree (shared) because they require more communication rounds than the other algorithms.Additionally, the Binary tree (one-sided) and Binomial tree (one-sided) algorithms may be more sensitive to the latency of the network, which can explain why they perform worse on higher node counts.Figure 13 shows the transfer times for the Binomial tree (shared), Binary tree (shared), Linear, Binary tree (one-sided), and Binomial tree (one-sided) algorithms for small data transfers larger than 1KB, using the MPI Remote Memory Access Model.
For small data transfers larger than 1KB, the Binomial tree (shared) algorithm and the Binary tree(shared) algorithm perform similarly on all node counts.The Linear algorithm is slightly slower than the Binomial tree (shared) algorithm and the Binary tree (shared) algorithm, but it is still faster than the Binary tree (one-sided) algorithm and the Binomial tree (one-sided) algorithm.For example, on 8 nodes, the Binomial tree (shared) algorithm and the Binary tree (shared) algorithm have almost the same transfer time for a 1KB data transfer.The Linear algorithm is about 68% slower than the Binomial tree (shared) algorithm and the Binary tree(shared) algorithm for a 1KB data transfer.The Binary tree (one-sided) algorithm is about 73% slower than the Binomial tree (shared) algorithm and the Binary tree (shared) algorithm for a 1KB data transfer.The Binomial tree (one-sided) algorithm is about 83% slower than the Binomial tree (shared) algorithm and the Binary tree (shared) algorithm for a 1KB data transfer.

Conclusion
This study introduces a hybrid solution that seamlessly integrates message forwarding (across-node) and shared memory (on-node) techniques.This development is a response to the widespread adoption of multi-core technology and presents a significant advantage in reducing memory consumption, particularly in data-intensive operations.
In this paper, we have presented a novel approach centered around the Binomial tree algorithm for MPI_Broadcast operations.This approach harnesses shared memory, and we have constructed For larger messages, particularly those up to 33 MB in size, our experiments have demonstrated that the Binomial (Shared) and Binary (Shared) tree algorithms offer the highest efficiency.As both the number of processes and data sizes increase, the Binomial tree algorithm integrated with the shared memory approach emerges as a strongly recommended choice.
Our future endeavors include a continued refinement of this approach, with the integration and testing of additional algorithms.The ultimate objective is to pinpoint the optimal algorithm for handling Broadcast messages within shared memory models on MPI.This research represents a crucial step forward in achieving more efficient and effective methods for data transfer, significantly contributing to the overarching goal of optimizing large-scale data operations.

Figure 1 .
Figure 1.RMA Put operation in progress.The process 0 writes data to the memory of the process 1.

Figure 3 .
Figure 3. Broadcast operation scheme3.1.Sequential Linear algorithmThe root r put the buffer b to all other rank windows.Process p0 put the buffer in the memory of process p = 1, 2, 3, ..., p.In this case of with size of 8 the total transferring time is TTotal = T (p -1)[11,12].

Figure 9 .
Figure 9. Broadcast algorithms for a distributed non-blocking queue (a -RMA Bcast Binomial, b -Input loop, c -Move data into the shared array, d -Compute shifted rank method.

Figure 10 .
Figure 10.Performance of Binomial Tree, Linear, and Binary Tree (one side and shared memory approaches) for Data size 33MB with different number of processes

Figure 11 .
Figure 11.Performance of Binomial Tree(Shared and One Sided), Binary Tree(Shared and One Sided), and Linear for Data size 0.5MB with different number of processes

Figure 12 .
Figure 12.Performance of Binomial Tree, Linear, and Binary Tree for Data size 8KB with different number of processes

Figure 13 .
Figure 13.Performance of Binomial Tree (Shared and One-sided), Binary Tree(Shared and Onesided), and Linear for Data size 1KB with different number of processes .1088/1742-6596/2697/1/012035 13 performance models through a series of communication experiments.The applicability of this approach has been extended to Open MPI.
number of nodes increases, but the Binomial tree (shared) algorithm remains the fastest algorithm on smaller node counts. 11