The diversity of Fog Computing deployment models and the lack of publicly available Fog infrastructure makes the design of an efficient application or resource management policy a challenging task. Such research often requires a test framework that facilitates the experimental evaluation of an application or protocol design in a repeatable and controllable manner. In this paper, we present EmuFog-an extensible emulation framework tailored for Fog Computing scenarios-that enables the from-scratch design of Fog Computing infrastructures and the emulation of real applications and workloads. EmuFog enables researchers to design the network topology according to the use-case, embed Fog Computing nodes in the topology and run Docker-based applications on those nodes connected by an emulated network. Each of the sub-modules of EmuFog are easily extensible, although EmuFog provides a default implementation for each of them. The scalability and efficacy of EmuFog are evaluated both on synthetic and real-world network topologies.
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size of DL models and the proliferation of vast amounts of training data being available. To keep on improving the performance of DL, increasing the scalability of DL systems is necessary. In this survey, we perform a broad and thorough investigation on challenges, techniques and tools for scalable DL on distributed infrastructures. This incorporates infrastructures for DL, methods for parallel DL training, multitenant resource scheduling and the management of training and model data. Further, we analyze and compare 11 current open-source DL frameworks and tools and investigate which of the techniques are commonly implemented in practice. Finally, we highlight future research trends in DL systems that deserve further research.One of the driving factors of the success of DL is the scale of training in three dimensions. The first dimension of scale is the size and complexity of the models themselves. Starting from simple, shallow neural networks, with increasing depth and more sophisticated model architectures, new breakthroughs in model accuracy were achieved [30,38]. The second dimension of scale is the amount of training data. The model accuracy can, to a large extent, be improved by feeding more training data into the model [56,63]. In practice, it is reported that 10s to 100s of Terabyte (TB) of training data are used in the training of a DL model [27,62]. The third dimension is the scale of the infrastructure. The availability of programmable highly-parallel hardware, especially graphics processing units (GPUs), is a key-enabler to training large models with a lot of training data in a short time [30,206].Our survey is focused on challenges that arise when managing a large, distributed infrastructure for DL. Hosting a large amount of DL models that are trained with large amounts of training data is challenging. This includes questions of parallelization, resource scheduling and elasticity, data management and portability. This field is now in rapid development, with contributions from diverse research communities such as distributed and networked systems, data management, and machine learning. At the same time, we see a number of open source DL frameworks and orchestration systems emerging [4,24,141,195]. In this survey, we bring together, classify and compare the huge body of work on distributed infrastructures for DL from the different communities that contribute to this area. Furthermore, we provide an overview and comparison of the existing open-source DL frameworks and tools that put distributed DL into practice. Finally, we highlight and discuss open research challenges in this field. Complementary SurveysThere are a number of surveys on DL that are complementary to ours. Deng [41] provides a general survey on DL architectures, algorithms and applications. LeCunn et al. pro...
Stream Processing (SP) has evolved as the leading paradigm to process and gain value from the high volume of streaming data produced e.g. in the domain of the Internet of ings. An SP system is a middleware that deploys a network of operators between data sources, such as sensors, and the consuming applications. SP systems typically face intense and highly dynamic data streams. Parallelization and elasticity enables SP systems to process these streams with continuously high quality of service. e current research landscape provides a broad spectrum of methods for parallelization and elasticity in SP. Each method makes speci c assumptions and focuses on particular aspects of the problem. However, the literature lacks a comprehensive overview and categorization of the state of the art in SP parallelization and elasticity, which is necessary to consolidate the state of the research and to plan future research directions on this basis. erefore, in this survey, we study the literature and develop a classi cation of current methods for both parallelization and elasticity in SP systems. or even di erent processing nodes in a shared-nothing cloud-based infrastructure. Frequent state synchronization must not hamper parallel processing, while the processing results have to remain consistent. Research proposes di erent approaches for parallel, stateless and stateful SP. ey di er in assumptions about the operator functions and state externalization mechanisms an SP system supports.is led to the development of a broad range of parallelization approaches tackling di erent problem cases.Second, how to continuously adapt the level of parallelization when the conditions of the SP operators, e.g. the workload or resources available, change at runtime. On the one hand, an SP system always needs enough resources to process the input data streams with a satisfying quality of service (QoS), e.g. latency or throughput. On the other hand, continuous provisioning of computing resources for peak workloads wastes resources at o -peak hours. us, an elastic SP system scales its resources according to the current need. Cloud computing provides on-demand resources to realize such elasticity [9]. e pay-as-you-go business model of cloud computing allows to cut costs by dynamically adapting the resource reservations to the needs of the SP system. It is challenging to strive the right balance between resource over-provisioning-which is costly, but is robust to workload uctuations-and on-demand scaling-which is cheap, but is vulnerable to sudden workload peaks. To this end, academia and industry developed elasticity methods. Again, they di er in their optimization objectives and assumptions about the operator parallelization model employed, the target system architecture, state management as well as timing and methodology.While there are many works that propose methods and solutions for speci c parallelization and elasticity problems in SP systems, there is a severe lack of overview, comparison, and classi cation of these methods. When we investigate...
The tremendous number of sensors and smart objects being deployed in the Internet of Things pose the potential for IT systems to detect and react to live-situations. For using this hidden potential, Complex Event Processing (CEP) systems offer means to efficiently detect event patterns (complex events) in the sensor streams and therefore help in realizing a "distributed intelligence" in the Internet of Things. With the increasing number of data sources and the increasing volume at which data is produced, parallelization of event detection is crucial to limit the time events need to be buffered before they actually can be processed. In this article, we propose a pattern-sensitive partitioning model for data streams that is capable of achieving a high degree of parallelism in detecting event patterns which formerly could only be consistently detected in a sequential manner or at a low parallelization degree. Moreover, we propose methods to dynamically adapt the parallelization degree to limit the buffering imposed on event detection in the presence of dynamic changes to the workload. Extensive evaluations of the system behavior show that the proposed partitioning model allows for a high degree of parallelism and that the proposed adaptation methods are able to meet a buffering limit for event detection under high and dynamic workloads.
In recent years, the graph partitioning problem gained importance as a mandatory preprocessing step for distributed graph processing on very large graphs. Existing graph partitioning algorithms minimize partitioning latency by assigning individual graph edges to partitions in a streaming mannerat the cost of reduced partitioning quality. However, we argue that the mere minimization of partitioning latency is not the optimal design choice in terms of minimizing total graph analysis latency, i.e., the sum of partitioning and processing latency. Instead, for complex and long-running graph processing algorithms that run on very large graphs, it is beneficial to invest more time into graph partitioning to reach a higher partitioning quality -which drastically reduces graph processing latency. In this paper, we propose ADWISE, a novel window-based streaming partitioning algorithm that increases the partitioning quality by always choosing the best edge from a set of edges for assignment to a partition. In doing so, ADWISE controls the partitioning latency by adapting the window size dynamically at run-time. Our evaluations show that ADWISE can reach the sweet spot between graph partitioning latency and graph processing latency, reducing the total latency of partitioning plus processing by up to 23 − 47 percent compared to the state-of-the-art. Single-edgeAll-edge Fig. 1: Research gap -adaptive window-based streaming vertex-cut partitioning. NE[40]this paper due to its superior partitioning properties on realworld graphs compared to edge-cut partitioning [4]. In vertexcut partitioning, each vertex can reside on multiple partitions, i.e., can be replicated across the corresponding worker machines. However, a replicated vertex causes synchronization and communication overhead between the worker machines, inducing higher graph processing latency [2], [6], [7]. Hence, graph processing latency strongly correlates with partitioning quality, defined as the replication degree of vertices on the different worker machines. The problem of partitioning a graph optimally, i.e., with minimal vertex replication, is impracticable for large graphs due to its NP-hardness [8]. In literature, there are two basic approaches to practically address the partitioning problem: (i) single-edge streaming algorithms perform partitioning decisions on one edge at a time, minimizing the partitioning latency, or (ii) all-edge algorithms load the complete graph into memory and employ global placement heuristics to optimize the partitioning quality. The existing algorithms follow either of the methods: Figure 1 illustrates the landscape of stateof-the-art vertex-cut partitioning algorithms. Modern graph processing systems use streaming partitioning when loading massive graphs due to their superior scalability and minimal runtime complexity [4], [9].In this paper, we investigate whether it is always optimal to invest minimal partitioning latency as done by the established streaming partitioning algorithms. Clearly, there is a tradeoff between partitioning ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.