William Hasenplaugh scite author profile

Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared cache performance can be significantly improved by preserving the entire working set of applications that can co-exist in the cache and preserving some portion of the working set of the remaining applications.This paper investigates the use of adaptive insertion policies to manage shared caches. We show that directly extending the recently proposed dynamic insertion policy (DIP) is inadequate for shared caches since DIP is unaware of the characteristics of individual applications. We propose Thread-Aware Dynamic Insertion Policy (TADIP) that can take into account the memory requirements of each of the concurrently executing applications. Our evaluation with multi-programmed workloads for 2-core, 4-core, 8-core, and 16-core CMPs show that a TADIP-managed shared cache improves overall throughput by as much as 94%, 64%, 26%, and 16% respectively (on average 14%, 18%, 15%, and 17%) over the baseline LRU policy. The performance benefit of TADIP is 2.6x compared to DIP and 1.3x compared to the recently proposed Utility-based Cache Partitioning (UCP) scheme. We also show that a TADIP-managed shared cache provides performance benefits similar to doubling the size of an LRU-managed cache. Furthermore, TADIP requires a total storage overhead of less than two bytes per core, does not require changes to the existing cache structure, and performs similar to LRU for LRU friendly workloads.

show abstract

Ordering heuristics for parallel graph coloring

Hasenplaugh¹,

Kaler²,

Schardl³

et al. 2014

View full text Add to dashboard Cite

This paper introduces the largest-log-degree-first (LLF) and smallest-log-degree-last (SLL) ordering heuristics for parallel greedy graph-coloring algorithms, which are inspired by the largest-degree-first (LF) and smallest-degree-last (SL) serial heuristics, respectively. We show that although LF and SL, in practice, generate colorings with relatively small numbers of colors, they are vulnerable to adversarial inputs for which any parallelization yields a poor parallel speedup. In contrast, LLF and SLL allow for provably good speedups on arbitrary inputs while, in practice, producing colorings of competitive quality to their serial analogs.We applied LLF and SLL to the parallel greedy coloring algorithm introduced by Jones and Plassmann, referred to here as JP. Jones and Plassman analyze the variant of JP that processes the vertices of a graph in a random order, and show that on an O(1)-degree graph G = (V, E), this JP-R variant has an expected parallel running time of O(lgV / lg lgV ) in a PRAM model. We improve this bound to show, using work-span analysis, that JP-R, augmented to handle arbitrary-degree graphs, colors a graph G = (V, E) with degree ∆ using Θ(V + E) work and O(lgV + lg ∆ · min{ √ E, ∆ + lg ∆ lgV / lg lgV }) expected span. We prove that JP-LLF and JP-SLL-JP using the LLF and SLL heuristics, respectivelyexecute with the same asymptotic work as JP-R and only logarithmically more span while producing higher-quality colorings than JP-R in practice.We engineered an efficient implementation of JP for modern shared-memory multicore computers and evaluated its performance on a machine with 12 Intel Core-i7 (Nehalem) processor cores. Our implementation of JP-LLF achieves a geometric-mean speedup of 7.83 on eight real-world graphs and a geometric-mean speedup of 8.08 on ten synthetic graphs, while our implementation using SLL achieves a geometric-mean speedup of 5.36 on these real-world graphs and a geometric-mean speedup of 7.02 on these synthetic graphs. Furthermore, on one processor, JP-LLF is slightly faster than a well-engineered serial greedy algorithm using LF, and likewise, JP-SLL is slightly faster than the greedy algorithm using SL.

show abstract

Executing dynamic data-graph computations deterministically using chromatic scheduling

Kaler¹,

Hasenplaugh²,

Schardl³

et al. 2014

View full text Add to dashboard Cite

A data-graph computation -popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi -is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex's prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next round.This paper introduces PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by PRISM to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze PRISM using work-span analysis. Let G = (V, E) be a degree-∆ graph colored with χ colors, and suppose that Q ⊆ V is the set of active vertices in a round. Define size(Q) = |Q| + v∈Q deg(v), which is proportional to the space required to store the vertices of Q using a sparsegraph layout. We show that a P-processor execution of PRISM performs updates in Q using O(χ(lg(Q/χ) + lg ∆) + lg P) span and Θ(size(Q) + χ + P) work. These theoretical guarantees are matched by good empirical performance. We modified GraphLab to incorporate PRISM and studied seven application benchmarks on a 12-core multicore machine. PRISM executes the benchmarks 1.2-2.1 times faster than GraphLab's nondeterministic lock-based scheduler while providing deterministic behavior. This paper also presents PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is more involved, incorporating a multivector data structure to maintain an ordered set of vertices partitioned by color.

show abstract

Ordering Heuristics for Parallel Graph Coloring

Hasenplaugh¹,

Kaler²,

Schardl³

et al. 2022

View full text Add to dashboard Cite

The graph coloring problem asks for an assignment of the minimum number of distinct colors to vertices in an undirected graph with the constraint that no pair of adjacent vertices share the same color. The problem is a thoroughly studied NP-hard combinatorial problem with several real-world applications. As such, a number of greedy heuristics have been suggested that strike a good balance between coloring quality, execution time, and also parallel scalability. In this work, we introduce a graph neural network (GNN) based ordering heuristic and demonstrate that it outperforms existing greedy ordering heuristics both on quality and performance. Previous results have demonstrated that GNNs can produce high-quality colorings but at the expense of excessive running time. The current paper is the first that brings the execution time down to compete with existing greedy heuristics. Our GNN model is trained using both supervised and unsupervised techniques. The experimental results show that a 2-layer GNN model can achieve execution times between the largest degree first (LF) and smallest degree last (SL) ordering heuristics while outperforming both on coloring quality. Increasing the number of layers improves the coloring quality further, and it is only at four layers that SL becomes faster than the GNN. Finally, our GNNbased coloring heuristic achieves superior scaling in the parallel setting compared to both SL and LF.

show abstract

Multiaperture imaging

Shankar

Hasenplaugh²,

Morrison

et al. 2006

Appl. Opt.

View full text Add to dashboard Cite

We study the reconstruction of a high-resolution image from multiple low-resolution images by using a nonlinear iterative backprojection algorithm. We exploit diversities in the imaging channels, namely, the number of imagers, magnification, position, rotation, and fill factor, to undo the degradation caused by the optical blur, pixel blur, and additive noise. We quantify the improvements gained by these diversities in the reconstruction process and discuss the trade-off among system parameters. As an example, for a system in which the pixel size is matched to the diffraction-limited optical blur size at a moderate detector noise level, we can reduce the reconstruction root-mean-square error by 570% by using 16 cameras and a large amount of diversity. The algorithm was implemented on a 56 camera array specifically constructed to demonstrate the resolution-enhancement capabilities. Practical issues associated with building and operating this device are presented and analyzed.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.