Harsha Vardhan Simhadri scite author profile

Making efficient use of cache hierarchies is essential for achieving good performance on multicore and other shared-memory parallel machines. Unfortunately, designing algorithms for complicated cache hierarchies can be difficult and tedious. To address this, recent work has developed high-level models that expose locality in a manner that is oblivious to particular cache or processor organizations, placing the burden of making effective use of a parallel machine on a runtime task scheduler rather than the algorithm designer/programmer. This paper continues this line of work by (i) developing a new model for parallel cache cost, (ii) developing a task scheduler for irregular tasks on cache hierarchies, and (iii) proving that the scheduler assigns tasks to processors in a work-efficient manner (including cache costs) relative to the model. As with many previous models, our model allows algorithms to be analyzed using a single level of cache with parameters M (cache size) and B (cache-line size), and algorithms can be written cache obliviously (with no choices made based on machine parameters). Unlike previous models, our cost Q α (n; M, B), for problem size n, captures costs due to work-space imbalance among tasks, and we prove a lower bound that shows that some sort of penalty is needed to achieve work efficiency. Nevertheless, for many algorithms, Q α () is asymptotically equal to the standard sequential cache cost Q().Our task scheduler is a specific "space-bounded scheduler," which assigns subtasks to caches based on their space usage. Our scheduler extends prior work by efficiently scheduling "irregular" computations with arbitrary work imbalance among parallel subtasks, reflected by the Q α () cost. Moreover, in addition to proving bounds on cache complexity, we also provide a bound on the total running time of an program execution using our scheduler. Specifically, our scheduler executes a program on a homogeneous h-level parallel memory hierarchy having p processors in time O (v h /p) h i=0 Q α (n; M i , B) · C i , where M i is the size of the level-i cache, B is the cache-line size, C i is the cost of level-i cache miss, and v h is an overhead defined in the paper. Ignoring the overhead v h , which may be small (or even constant) when certain side conditions hold, this bound is optimal whenever Q α () matches the sequential cache complexity-at every level of the hierarchy it is dividing the total cache cost by the total number of processors.

show abstract

Word2Sense: Sparse Interpretable Word Embeddings

Panigrahi¹,

Simhadri²,

Bhattacharyya³

2019

View full text Add to dashboard Cite

We present an unsupervised method to generate Word2Sense word embeddings that are interpretable-each dimension of the embedding space corresponds to a fine-grained sense, and the non-negative value of the embedding along the j-th dimension represents the relevance of the j-th sense to the word. The underlying LDA-based generative model can be extended to refine the representation of a polysemous word in a short context, allowing us to use the embeddings in contextual tasks. On computational NLP tasks, Word2Sense embeddings compare well with other word embeddings generated by unsupervised methods. Across tasks such as word similarity, entailment, sense induction, and contextual interpretation, Word2Sense is competitive with the state-of-the-art method for that task. Word2Sense embeddings are at least as sparse and fast to compute as prior art.

show abstract

Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

Dinh

Simhadri

Tang

2016

View full text Add to dashboard Cite

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. " " (parallel) and " ; " (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct " ; " to express partial dependencies in algorithms in a processor-and cacheoblivious way, thus extending the Nested Parallel (NP) model to the Nested Dataflow (ND) model. We redesign several divide-andconquer algorithms ranging from dense linear algebra to dynamicprogramming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies [4]) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is, where Q * is the cache complexity of task t, C i is the cost of cache miss at level-i cache which is of size M i , σ ∈ (0, 1) is a constant, and p is the number of processors in an h-level cache hierarchy.

show abstract

Parallel and I/O efficient set covering algorithms

Blelloch

Simhadri

Tangwongsan

2012

View full text Add to dashboard Cite

This paper presents the design, analysis, and implementation of parallel and sequential I/O-efficient algorithms for set cover, tying together the line of work on parallel set cover and the line of work on efficient set cover algorithms for large, disk-resident instances.Our contributions are twofold: First, we design and analyze a parallel cache-oblivious set-cover algorithm that offers essentially the same approximation guarantees as the standard greedy algorithm, which has the optimal approximation. Our algorithm is the first efficient external-memory or cacheoblivious algorithm for when neither the sets nor the elements fit in memory, leading to I/O cost (cache complexity) equivalent to sorting in the Cache Oblivious or Parallel Cache Oblivious models. The algorithm also implies low cache misses on parallel hierarchical memories (again, equivalent to sorting). Second, building on this theory, we engineer variants of the theoretical algorithm optimized for different hardware setups. We provide experimental evaluation showing substantial speedups over existing algorithms without compromising the solution's quality.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.