We describe a methodology that enables the real-time diagnosis of performance problems in complex high-performance distributed systems. The IntroductionDevelopers of high-speed network-based distributed systems often observe performance problems such as unexpectedly low network throughput or high latency. The reasons for the poor performance can be manifold and are frequently not obvious. It is often difficult to track down performance problems because of the complex interaction between the many distributed system components, and the fact that performance problems in one place may be most apparent somewhere else. Bottlenecks can occur in any of the components along the paths through which the data flow: the applications, the operating systems, the device drivers, the network adapters on any of the sending or receiving hosts, and/or in network components such as switches and routers. Sometimes bottlenecks involve interactions among several components or the interplay of protocol parameters at different points in the system, and sometimes, of course, they are due to unrelated network activity impacting the operation of the distributed system. While post-hoc diagnosis of performance problems is valuable for systemic problems, for operational problems users will have already suffered through a period of degraded performance. The ability to recognize operational problems would enable elements of the distributed system to use this information to adapt to operational conditions, minimizing the impact on users.We have developed a methodology, known as NetLogger, for monitoring, under realistic operating conditions, the behavior of all the elements of the application-to-application communication path in order to determine exactly what is happening within a complex system. Distributed application components, as well as some operating system components, are modified to perform precision timestamping and logging of "interesting" events, at every critical point in the distributed system. The events are correlated with the system's behavior in order to characterize the performance of all aspects of the system and network in detail during actual operation. The monitoring is designed to facilitate identification of bottlenecks, performance tuning, and network performance research. It also allows accurate measurement of throughput and latency characteristics for distributed application codes.Software agents collect and filter event-based performance information, turn on or off various monitoring options to adapt the monitoring to the current system state, and manage the large amounts of log data that are generated.The goal of this performance characterization work is to produce high-speed components that can be used as building blocks for high-performance applications, rather than having to "tune" the applications top-to-bottom as is all too common today. This method can also provide an information source for applications that can adapt to component congestion problems.NetLogger has demonstrated its usefulness with th...
Developers and users of high-performance distributed systems often observe performance problems such as unexpectedly low throughput or high latency. Determining the source of the performance problems requires detailed end-to-end instrumentation of all components, including the applications, operating systems, hosts, and networks. In this paper we describe a methodology that enables the real-time diagnosis of performance problems in complex high-performance distributed systems. The methodology includes tools for generating timestamped event logs that can be used to provide detailed end-to-end application and system level monitoring; and tools for visualizing the log data and real-time state of the distributed system. This methodology, called NetLogger, has proven invaluable for diagnosing problems in networks and in distributed systems code. This approach is novel in that it combines network, host, and application-level monitoring, providing a complete view of the entire system. NetLogger is designed to be extremely lightweight , and includes a mechanism for reliably collecting monitoring events from multiple distributed locations. This technical report summarizes most important points of several previous papers on NetLogger, and is meant to be used as a general overview.
Modern scientific computing involves organizing, moving, visualizing, and IntroductionHigh-speed data streams resulting from the operation of on-line instruments and imaging systems are a staple of modern scientific, health care, and intelligence environments. The advent of high-speed networks is providing the potential for new approaches to the collection, organization, storage, analysis, visualization, and distribution of the large-data-objects that result from such data streams. The result will be to make both the data and its analysis much more readily available.For example, health care imaging systems illustrate the need for both high data rates and real-time cataloging. Medical video and image data used for diagnostic purposes -e.g., X-ray CT, MRI, and cardio-angiography -are collected at centralized facilities and may be accessed at locations other than the point of collection (e.g., the hospitals of the referring physicians). A second example is high energy physics experiments, which generate high rates and massive volumes of data that must be processed and archived in real time. This data must also be accessible to large scientific collaborations -typically hundreds of investigators at dozens of institutions around the world.In this paper we will describe how "Computational Grid" environments can be used to help with these types of applications, and how a high-speed network cache is a particularly important component in a data intensive grid architecture. We describe our implementation of a network cache, how we have made it "network aware," and how we adapt its operation to current network conditions. Data Intensive GridsThe integration of the various technological approaches being used to address the problem of integrated use of dispersed resources is frequently called a "grid," or a computational grid -a name arising by analogy with the grid that supplies ubiquitous access to electric power. See, e.g., [10]. Basic grid services are those that locate, allocate, coordinate, utilize, and provide for human interaction with 1. The work described in this paper is supported by DARPA, Information Technology Office (http://www.darpa.mil/ito/Research Areas.html) and the U. S.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.