Parallel debugging faces challenges in both scalability and efficiency. A number of advanced methods have been invented to improve the efficiency of parallel debugging. As the scale of system increases, these methods highly rely on a scalable communication protocol in order to be utilized in large-scale distributed environments. This paper describes a debugging middleware that provides fundamental debugging functions supporting multiple communication protocols. Its pluggable architecture allows users to select proper communication protocols as plug-ins for debugging on different platforms. It aims to be utilized by various advanced debugging technologies across different computing platforms. The performance of this debugging middleware is examined on a Cray XE Supercomputer with 21,760 CPU cores.MRNet and SCI provide different tree topologies, communication features and launch methods. We compare them as follows.
A. MRNet (Multicast/Reduction Network)MRNet is a software overlay network using a tree of communication processes to connect FE and BE nodes. Its communication tree can be utilized to broadcast/multicast messages downstream and collect or aggregate messages upstream. The tree organization is configurable and it supports common network layouts like k-ary and k-nomial trees, or custom layouts tailored to the specific requirements.Its communication is achieved by using the filter, stream, and communicator components. A communicator represents a group of BE nodes. A stream is a logical channel that connects the FE with the BE nodes of a communicator. Each stream may be attached with a filter that can modify data transferred across it. Message aggregation can be realized via programming filters. Both synchronous and asynchronous ways of receiving messages are provided in MRNet.MRNet supports an attachment mode to create a communication tree. Specifically, MRNet creates only internal processes, while BE processes are created by either a system management or job scheduling tools. The BE processes, after being created by an external service, attaches to the tree that is instantiated by MRNet.