Parallel computing is one of the top priorities in computer science. The main means of parallel processing information is a distributed computing system (CS) - a composition of elementary machines that interact through a communication medium. Modern distributed VSs implement thread-level parallelism (TLP) within a single computing node (multi-core CS with shared memory), as well as process-level parallelism (PLP) process-level parallelism for the entire distributed CS. The main tool for developing parallel programs for such systems is the MPI standard. The need to create scalable parallel programs that effectively use compute nodes with shared memory has determined the development of the MPI standard, which today supports the creation of hybrid multi-threaded MPI programs. A hybrid multi-threaded MPI program is the combination of the computational capabilities of processes and threads. The standard defines four types of multithreading: Single - one thread of execution; Funneled - a multi-threaded program, but only main thread can perform MPI operations; Serialized - only one thread at the exact same time can make a call to MPI functions; Multiple - each program flow can perform MPI functions at any time. The main task of the multiple mode is the need to synchronize the communication flows within each process. This paper presents an overview of the work that addresses the problem of synchronizing processes running on remote machines and synchronizing internal program threads. Method for synchronization of threads based on queues with weakened semantics of operations is proposed.
In this work we analyse the efficiency of atomic operations compare-and-swap (CAS), fetch-and-add (FAA), swap (SWP), load and store on modern multicore processors. These operations implemented in hardware as processor instructions are highly demanded in multithreaded programming (design of thread locks and non-blocking data structures). In this article we study the influence of cache coherence protocol, size and locality of the data on the latency of the operations. We developed a benchmark for analyzing the dependencies of throughput and latency on these parameters. We present the results of the evaluation of the efficiency of atomic operations on modern x86-64 processors and give recommendations for the optimizations. Particularly we found atomic operations, which have minimum (load), maximum (“successful CAS”, store) and comparable (“unsuccessful CAS”, FAA, SWP) latency. We showed that the choice of a processor core to perform the operation and the state of cache-line impact on the latency at average 1.5 and 1.3 times respectively. The suboptimal choice of the parameters may increase the throughput of atomic operations from 1.1 to 7.2 times. Our evidences may be used in the design of new and optimization of existing concurrent data structures and synchronization primitives.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.