Abstract-Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as "Long Tail", whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs.
Abstract-The next era of computing is the evolution of the Internet of Things (IoT) and Smart Cities with development of the Internet of Simulation (IoS). The existing technologies of Cloud, Edge, and Fog computing as well as HPC being applied to the domains of Big Data and deep learning are not adequate to handle the scale and complexity of the systems required to facilitate a fully integrated and automated smart city. This integration of existing systems will create an explosion of data streams at a scale not yet experienced. The additional data can be combined with simulations as services (SIMaaS) to provide a shared model of reality across all integrated systems, things, devices, and individuals within the city. There are also numerous challenges in managing the security and safety of the integrated systems. This paper presents an overview of the existing stateof-the-art in automating, augmenting, and integrating systems across the domains of smart cities, autonomous vehicles, energy efficiency, smart manufacturing in Industry 4.0, and healthcare. Additionally the key challenges relating to Big Data, a model of reality, augmentation of systems, computation, and security are examined.
Abstract-A trend seen in many industries is the increasing reliance on modelling and simulation to facilitate design, decision making and training. Previously, these models would operate in isolation but now there is a growing need to integrate and connect simulations together for co-simulation. In addition, the 21 st century has seen the expansion of the Internet of Things (IoT) enabling the interconnectivity of smart devices across the Internet. In this paper we propose that an important, and often overlooked, domain of IoT is that of modelling and simulation. Expanding IoT to encompass interconnected simulations enables the potential for an Internet of Simulation (IoS) whereby models and simulations are exposed to the wider internet and can be accessed on an "as-a-service" basis. The proposed IoS would need to manage simulation across heterogeneous infrastructures; temporal and causal aspects of simulations; as well as variations in data structures. Via the proposed Simulation as a Service (SIMaaS) and Workflow as a Service (WFaaS) constructs in IoS, highly complex simulation integration could be performed automatically, resulting in high fidelity system level simulations. Additionally, the potential for faster than real-time simulation afforded by IoS opens the possibility of connecting IoS to existing IoT infrastructure via a real-time bridge to facilitate decision making based on live data.
Abstract-The trend towards turning existing cities into smart cities is growing. Facilitated by advances in computing such as Cloud services and Internet of Things (IoT), smart cities propose to bring integrated, autonomous systems together to improve quality of life for their inhabitants. Systems such as autonomous vehicles, smart grids and intelligent traffic management are in the initial stages of development. However, as of yet there, is no holistic architecture on which to integrate these systems into a smart city. Additionally, the existing systems and infrastructure of cities is extensive and critical to their operation. We cannot simply replace these systems with smarter versions, instead the system intelligence must augment the existing systems. In this paper we propose a service oriented reference architecture for smart cities which can tackle these problems and identify some related open research questions. The abstract architecture encapsulates the way in which different aspects of the service oriented approach span through the layers of existing city infrastructure. Additionally, the extensible provision of services by individual systems allows for the organic growth of the smart city as required.
-Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62% less replicas under high resource utilization while reducing response time up to 17.86% for idle periods compared to a static threshold.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.