We study the problem of k-means clustering in the presence of outliers. The goal is to cluster a set of data points to minimize the variance of the points assigned to the same cluster, with the freedom of ignoring a small set of data points that can be labeled as outliers. Clustering with outliers has received a lot of attention in the data processing community, but practical, efficient, and provably good algorithms remain unknown for the most popular k-means objective. Our work proposes a simple local search-based algorithm for k-means clustering with outliers. We prove that this algorithm achieves constant-factor approximate solutions and can be combined with known sketching techniques to scale to large data sets. Using empirical evaluation on both synthetic and large-scale real-world data, we demonstrate that the algorithm dominates recently proposed heuristic approaches for the problem.
In this paper we study the problem of scheduling a set of dynamic multithreaded jobs with the objective of minimizing the maximum latency experienced by any job. We assume that jobs arrive online and the scheduler has no information about the arrival rate, arrival time or work distribution of the jobs. The scheduling goal is to minimize the maximum amount of time between the arrival of a job and its completion-this goal is referred to in scheduling literature as maximum flow time. While theoretical online scheduling of parallel jobs has been studied extensively, most prior work has focussed on a highly stylized model of parallel jobs called the "speedup curves model." We model parallel jobs as directed acyclic graphs, which is a more realistic way to model dynamic multithreaded jobs. In this context, we prove that a simple First-In-First-Out scheduler is (1 +)-speed O(1)-competitive for any > 0. We then develop a more practical work-stealing scheduler and show that it has a maximum flow time of O(1 2 max{OPT, ln(n)}) for n jobs, with (1 +)-speed. This result is essentially tight as we also provide a lower bound of Ω(log(n)) for work stealing. In addition, for the case where jobs have weights (typically representing priorities) and the objective is minimizing the maximum weighted flow time, we show a non-clairvoyant algorithm is (1 +)-speed O(1 2)-competitive for any > 0, which is essentially the best positive result that can be shown in the online setting for the weighted case due to strong lower bounds without resource augmentation. After establishing theoretical results, we perform an empirical study of work-stealing. Our results indicate that, on both real world and synthetic workloads, work-stealing performs almost as well as an optimal scheduler.
In this work, we study the problem of scheduling parallelizable jobs online with an objective of minimizing average flow time. Each parallel job is modeled as a DAG where each node is a sequential task and each edge represents dependence between tasks. Previous work has focused on a model of parallelizability known as the arbitrary speed-up curves setting where a scalable algorithm is known. However, the DAG model is more widely used by practitioners, since many jobs generated from parallel programming languages and libraries can be represented in this model. However, little is known for this model in the online setting with multiple jobs. The DAG model and the speed-up curve models are incomparable and algorithmic results from one do not immediately imply results for the other. Previous work has left open the question of whether an online algorithm can be O(1)-competitive with O(1)-speed for average flow time in the DAG setting. In this work, we answer this question positively by giving a scalable algorithm which is (1 + ǫ)-speed O(1 ǫ 3)-competitive for any ǫ > 0. We further introduce the first greedy algorithm for scheduling parallelizable jobs-our algorithm is a generalization of the shortest jobs first algorithm. Greedy algorithms are among the most useful in practice due to their simplicity. We show that this algorithm is (2 + ǫ)-speed O(1 ǫ 4)competitive for any ǫ > 0.
Although concurrent data structures are commonly used in practice on shared-memory machines, even the most efficient concurrent structures often lack performance theorems guaranteeing linear speedup for the enclosing parallel program. Moreover, efficient concurrent data structures are difficult to design. In contrast, parallel batched data structures do provide provable performance guarantees, since processing a batch in parallel is easier than dealing with the arbitrary asynchrony of concurrent accesses. They can limit programmability, however, since restructuring a parallel program to use batched data structure instead of a concurrent data structure can often be difficult or even infeasible. This paper presents BATCHER, a scheduler that achieves the best of both worlds through the idea of implicit batching, and a corresponding general performance theorem. BATCHER takes as input (1) a dynamically multithreaded program that makes arbitrary parallel accesses to an abstract data type, and (2) an implementation of the abstract data type as a batched data structure that need not cope with concurrent accesses. BATCHER extends a randomized work-stealing scheduler and guarantees provably good performance to parallel algorithms that use these data structures. In particular, suppose a parallel algorithm has T 1 work, T ∞ span, and n data-structure operations. Let W (n) be the total work of datastructure operations and let s(n) be the span of a size-P batch. Then BATCHER executes the program in O((T 1 + W (n) + ns(n))/P + s(n)T ∞ ) expected time on P processors. For higher-cost data structures like search trees and large enough n, this bound becomes ((T 1 +n lg n)/P+T ∞ lg n), provably matching the work of a sequential search tree but with nearly linear speedup, even though the data structure is accessed concurrently. The BATCHER runtime bound also readily extends to data structures with amortized bounds.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.