Parallel debugging faces challenges in both scalability and efficiency. A number of advanced methods have been invented to improve the efficiency of parallel debugging. As the scale of system increases, these methods highly rely on a scalable communication protocol in order to be utilized in large-scale distributed environments. This paper describes a debugging middleware that provides fundamental debugging functions supporting multiple communication protocols. Its pluggable architecture allows users to select proper communication protocols as plug-ins for debugging on different platforms. It aims to be utilized by various advanced debugging technologies across different computing platforms. The performance of this debugging middleware is examined on a Cray XE Supercomputer with 21,760 CPU cores.MRNet and SCI provide different tree topologies, communication features and launch methods. We compare them as follows. A. MRNet (Multicast/Reduction Network)MRNet is a software overlay network using a tree of communication processes to connect FE and BE nodes. Its communication tree can be utilized to broadcast/multicast messages downstream and collect or aggregate messages upstream. The tree organization is configurable and it supports common network layouts like k-ary and k-nomial trees, or custom layouts tailored to the specific requirements.Its communication is achieved by using the filter, stream, and communicator components. A communicator represents a group of BE nodes. A stream is a logical channel that connects the FE with the BE nodes of a communicator. Each stream may be attached with a filter that can modify data transferred across it. Message aggregation can be realized via programming filters. Both synchronous and asynchronous ways of receiving messages are provided in MRNet.MRNet supports an attachment mode to create a communication tree. Specifically, MRNet creates only internal processes, while BE processes are created by either a system management or job scheduling tools. The BE processes, after being created by an external service, attaches to the tree that is instantiated by MRNet.
Relative debugging traces software errors by comparing two executions of a program concurrently-one code being a reference version and the other faulty. Relative debugging is particularly effective when code is migrated from one platform to another, and this is of significant interest for hybrid computer architectures containing CPUs accelerators or coprocessors. In this paper we extend relative debugging to support porting stencil computation on a hybrid computer. We describe a generic data model that allows programmers to examine the global state across different types of applications, including MPI/OpenMP, MPI/OpenACC, and UPC programs. We present case studies using a hybrid version of the 'stellarator' particle simulation DELTA5D, on Titan at ORNL, and the UPC version of Shallow Water Equations on Crystal, an internal supercomputer of Cray. These case studies used up to 5,120 GPUs and 32,768 CPU cores to illustrate that the debugger is effective and practical.
Contemporary parallel debuggers allow users to control more than one processing thread while supporting the same examination and visualisation operations of that of sequential debuggers. This approach restricts the use of parallel debuggers when it comes to large scale scientific applications run across hundreds of thousands compute cores. First, manually observing the runtime data to detect error becomes impractical because the data is too big. Second, performing expensive but useful debugging operations becomes infeasible as the computational codes become more complex, involving larger data structures, and as the machines become larger. This study explores the idea of a data-centric debugging approach, which could be used to make parallel debuggers more powerful. It discusses the use of ad hoc debug-time assertions that allow a user to reason about the state of a parallel computation. These assertions support the verification and validation of program state at runtime as a whole rather than focusing on that of only a single process state. Furthermore, the debugger's performance can be improved by exploiting the underlying parallel platform because the available compute cores can execute parallel debugging functions, while a program is idling at a breakpoint. We demonstrate the system with several case studies and evaluate the performance of the tool on a 20 000 cores Cray XE6.We review assertion usage under the traditional programming model, thus revealing the potential of using assertions for data-centric debugging in a parallel context. We then introduce and describe three distinctive debug-time assertion templates including general ad hoc assertions, comparative assertions and statistical assertions. Assertions under the traditional programming modelAn assertion is a statement about an intended behaviour of a system's component that must be verified during execution. In computer programming, a programmer defines an assertion to ensure a specific state of the program at runtime. Further, using assertions, programmers can inject their design intentions such as constraints or contracts into the executable code. As a result, assertions are used extensively for evaluating invariants, checking input parameters and for enhancing program correctness and quality (8). Assertion is an effective tool, not only for design and verification but also for debugging throughout the software-development cycle. The advantages in using assertions for 502 M. N. DINH ET AL.debugging software come from three important attributes: their ability to perform error detection, error isolation and error notification. In addition, assertions support a more data-centric view of debugging because a user does not focus on the control path, but can assert that various data structures should be in particular states at various stages in the program execution. A generic assertion-based parallel debuggerAssertions can also be used as a powerful tool for the debugging purpose. In particular, a programmer writes ad hoc assertions at debug time to tes...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.