MARS: A Multiprocessor-Based Programmable Accelerator

Agrawal, P.; Dally, William J.; Fischer, W.C.; Jagadish, H. V.; Krishnakumar, A. S.; Tutundjian, R.

doi:10.1109/mdt.1987.295211

Cited by 44 publications

(6 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Algorithm-parallel efforts aim at parallelizing the fault simulation algorithm, distributing workload and/or pipelining the tasks, such that the frequency of communication and synchronization between processors is reduced [14,2,3,4]. In contrast to these approaches, our approach is data-parallel.…”

Section: Previous Workmentioning

confidence: 99%

“…The approach discussed in [3] suggests a pipelined design, where each functional unit performs a specific task. MARS [4], a hardware accelerator, is based on this design. However, the application of the accelerator to fault simulation has been limited [14].…”

Section: Previous Workmentioning

confidence: 99%

“…Fault simulation can be parallelized by a variety of techniques. The techniques include parallelizing the fault simulation algorithm (algorithm-parallel techniques [2,3,4]), par-titioning the circuit into disjoint components and simulating them in parallel (model-parallel techniques [5,6]), partitioning the fault set data and simulating faults in parallel (data-parallel techniques [7,8,9,10,11,12,13]) and a combination of one or more of these techniques [14]. Data parallel techniques can be further classified into fault-parallel methods, wherein different faults are simulated in parallel, and pattern-parallel approaches, wherein different patterns of the same fault are simulated in parallel.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Towards acceleration of fault simulation using graphics processing units

Gulati

Khatri

2008

Proceedings of the 45th Annual Design Automation Conference

View full text Add to dashboard Cite

In this paper, we explore the implementation of fault simulation on a Graphics Processing Unit (GPU). In particular, we implement a fault simulator that exploits thread level parallelism. Fault simulation is inherently parallelizable, and the large number of threads that can be computed in parallel on a GPU results in a natural fit for the problem of fault simulation. Our implementation faultsimulates all the gates in a particular level of a circuit, including good and faulty circuit simulations, for all patterns, in parallel. Since GPUs have an extremely large memory bandwidth, we implement each of our fault simulation threads (which execute in parallel with no data dependencies) using memory lookup. Fault injection is also done along with gate evaluation, with each thread using a different fault injection mask. All threads compute identical instructions, but on different data, as required by the Single Instruction Multiple Data (SIMD) programming semantics of the GPU. Our results, implemented on a NVIDIA GeForce GTX 8800 GPU card, indicate that our approach is on average 35× faster when compared to a commercial fault simulation engine. With the recently announced Tesla GPU servers housing up to eight GPUs, our approach would be potentially 238× faster. The correctness of the GPU based fault simulator has been verified by comparing its result with a CPU based fault simulator.

show abstract

Section: Previous Workmentioning

confidence: 99%

Section: Previous Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards acceleration of fault simulation using graphics processing units

Gulati

Khatri

2008

Proceedings of the 45th Annual Design Automation Conference

View full text Add to dashboard Cite

show abstract

“…An unmodified sequential program can 2 Area was determined by measuring the processing components of various chips, in particular the R4600 described in [12].…”

Section: Sor With a Pipelined Fpu (400m(mentioning

confidence: 99%

“…Registermapped network interfaces have been used previously in the Mars Machine [2], J-Machine, and iWarp [4], and have been described by *T [26] as well as Henry and Joerg [15]. However, none of these systems provide protection for user-level messages.…”

Section: Sor With a Pipelined Fpu (400m(mentioning

confidence: 99%

The M-machine multicomputer

Fillo

Keckler

Dally

et al. 1997

Int J Parallel Prog

View full text Add to dashboard Cite

Because of the increasing density of VLSI integrated circuits, most of the chip area of modern computers is now occupied by memory and not by processing resources. The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by these constraints of modern semiconductor technology and the demands of programming systems, such as faster execution of fixed sized problems and easier programmability of parallel computers. Advances in VLSI technology have resulted in computers with chip area dominated by memory and not by processing resources. The normalized area (in ( 2 ) of a VLSI chip 1 is increasing by 50% per year, while gate speed and communication bandwidth are increasing by 20% per year [14]. As a result, a 64-bit proces- sor with a pipelined FPU (400M(2 ) 2 is only 8% of a 5G( 2 1996 0.355 m chip. In a system with 256 MBytes of DRAM, the processor accounts for 0.13% of the silicon area in the system. The memory system, cache, TLB, controllers, and DRAM account for most of the remaining area. Technology scaling has made the memory, rather than the processor, the most area-consuming resource in a computer system.To address this imbalance, the M-Machine increases the fraction of chip area devoted to processor, making better use of the critical memory resources. An M-Machine multi-ALU processor (MAP) chip contains four 64-bit three-issue clusters that comprise 32% of the 5G( 2 chip and 11% of an 8 MByte (six-chip) node. The multiple execution clusters will provide better peak performance than using a single cluster and a large on-chip cache in the same chip area. The high ratio of arithmetic bandwidth to memory bandwidth (12 operations/word) allows the MAP to saturate the costly DRAM bandwidth even on code with high cache-hit ratios. A 32-node M-Machine system with 256 MBytes of memory has 128 times the peak performance of a 1996 uniprocessor with the same memory capacity at 1.5 times the area, a 85:1 improvement in peak performance/area. Even at a small fraction of this peak performance, such a machine allows the costly, fixed-sized memory to handle more problems per unit time resulting in more cost-effective computing.The M-Machine is intended to extract more parallelism from problems of a fixed size, rather than requiring enormous problems to achieve peak performance. To do this, nodes are designed to manage parallelism at a variety of granularities, from the instruction level to the process level. The 12 function units in a single M-Machine node are controlled using a form of Processor Coupling [18] to exploit instruction level parallelism by executing 12 operations from the same thread, or to exploit thread-level parallelism by executing operations from up to six different threads. The fast internode communication allows collaborating threads to reside on different nodes.The M-Machine also addresses the demand for easier programmability by providing an incremental path for increasing parallelism and performance. An unmodified sequential program can 2 Area was determ...

show abstract

Computer‐Aided Design in Electronics

Domic

2003

Digital Encyclopedia of Applied Physics

View full text Add to dashboard Cite

MARS: A Multiprocessor-Based Programmable Accelerator

Cited by 44 publications

References 5 publications

Towards acceleration of fault simulation using graphics processing units

Towards acceleration of fault simulation using graphics processing units

The M-machine multicomputer

Computer‐Aided Design in Electronics

Contact Info

Product

Resources

About