Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors

Fiske, S.; Dally, W.J.

doi:10.1109/hpca.1995.386541

Cited by 5 publications

(2 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Virtually all existing programming models and paradigms are resembled at some level of HTMT, such as multithreading [10], client-server, message-passing (MPI), or a Linda system [7]. This implies enormous work in designing, testing, and optimizing the system, including concurrency and parallelism [6], load balancing, task migration [5], memory distribution and control distribution.…”

Section: Related Workmentioning

confidence: 99%

A microserver view of HTMT

Yerosheva

Kuntz

Brockman

et al.

Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001

View full text Add to dashboard Cite

HTMT is an ambitious new architecture combining cutting edge technologies to reach petaflop performance sooner than current technology trends allow. It is a massively parallel architecture with multi-threaded hardware and a multi-level memory hierarchy. Microservers provide a new perspective for viewing this memory hierarchy whereby memory is actively involved in process execution. This paper discusses the microserver memory semantics and initial HTMT execution models to analyze applications at each level of the system hierarchy and to develop user-level functions for expressing this inherent concurrency and parallelism. In order to do this we studied several applications to model the control and data flow within the HTMT hierarchy and developed pseudo-code representing the user-level functions necessary to express application concurrency and parallelism.

show abstract

Section: Related Workmentioning

confidence: 99%

A microserver view of HTMT

Yerosheva

Kuntz

Brockman

et al.

Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001

View full text Add to dashboard Cite

show abstract

“…Fine-grain thread parallelism is well suited to fill this performance gap, and well matched to the cluster organizations of future microprocessors. Most applications, even those with small problem sizes, have considerable fine-thread parallelism, and this parallelism, because of its limited extent, has a smaller cache footprint than coarse-thread alternatives [6].…”

Section: Relative Execution Timementioning

confidence: 99%

Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Keckler¹,

Dally²,

Maskit³

et al.

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)

Self Cite

View full text Add to dashboard Cite

Modern computer systems extract parallelism from problems at two extremes of granularity: instruction-level parallelism (ILP) and coarse-thread parallelism. VLIW and superscalar processors exploit ILP with a grain size of a single instruction, while multiprocessors extract parallelism from coarse threads with a granularity of many thousands of instructions.The parallelism available at these two extremes is limited. The ILP in applications is restricted by control flow and data dependencies [17], and the hardware in superscalar designs is not scalable. Both the instruction scheduling logic and the register file of a superscalar grow quadratically as the number of execution units is increased. For multicomputers, there is limited coarse thread parallelism at small problem sizes and in many applications. 1The research described in this paper was supported by the Defense Advanced Research Projects Agency and monitored by the Air Force Electronic Systems Division under contract F19628-92-C-0045. This paper describes and evaluates the hardware mechanisms implemented in the MIT Multi-ALU Processor (MAP chip) for extracting fine-thread parallelism. Fine-threads close the parallelism gap between the single instruction granularity of ILP and the thousand instruction granularity of coarse threads by extracting parallelism with a granularity of 50-1000 instructions. This parallelism is orthogonal and complementary to coarse-thread parallelism and ILP. Programs can be accelerated using coarse threads to extract parallelism from outer loops and large co-routines, finethreads to extract parallelism from inner loops and small subcomputations, and ILP to extract parallelism from subexpressions. As they extract parallelism from different portions of a program, coarse-threads, fine-threads, and ILP work synergistically to provide multiplicative speedup.These three modes are also well matched to the architecture of modern multiprocessors. ILP is well suited to extracting parallelism across the execution units of a single processor. Finethreads are appropriate for execution across multiple processors at a single node of a parallel computer where the interaction latencies are on the order of a few cycles. Coarse-threads are appropriate for execution on different nodes of a multiprocessor where interaction latencies are inherently 100s of cycles.Low overhead mechanisms for communication and synchronization are required to exploit fine-grain thread level parallelism. The cost to initiate a task, pass it arguments, synchronize with its completion, and return results must be small compared to the work accomplished by the task. Such inter-thread interaction requires 100s of cycles (2 13 s) on conventional multiprocessors, and 1000s of cycles (2 103 s) on multicomputers. Because of these high overheads, most parallel applications use only coarse threads, with many thousands of instructions between interactions.The Multi-ALU Processor (MAP) chip provides three on-chip processors and methods for quickly communicating and synchronizing among ...

show abstract