2014
DOI: 10.1007/s10766-014-0319-4
|View full text |Cite
|
Sign up to set email alerts
|

Exploiting GPUs with the Super Instruction Architecture

Abstract: The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays which are typically very large. The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL), and its runtime system, Super Instruction Processor. An important feature of SIAL is that algorithms are expressed in terms of blocks (or… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 12 publications
0
6
0
Order By: Relevance
“…It has been shown that by properly selecting intermediate arrays and optimizing the and loops, the efficiency of a CCSD code can be increased by a factor of 5 . To a larger extent, high numerical costs associated with the polynomial scaling can be effectively addressed by the development of highly scalable implementations of CC methods, as evidenced by several recent benchmark calculations. Growing interest in efficient utilization of peta- and soon-to-be exa-scale computational resources has stimulated an intensive development of various tensor libraries that can be exploited in generating scalable CC codes for homogeneous as well as for many-core/multicore computer systems. Nevertheless, in all above-mentioned examples of canonical CC implementations the storage requirement will quickly grow as a function of the system size to become a storage and communication bottleneck when going from mid- (10 2 to 10 3 basis functions) to large-scale (10 3 to 10 4 basis functions) CC calculations. Although it has been shown that by employing integral-direct algorithms the storage requirement can be greatly minimized, the integral-direct way might also bring frequent I/O operations and/or the necessity of recalculating “on-the-fly” atomic two-electron integrals, which would then increase the CPU time and deteriorate the scaling with system size.…”
Section: Introductionmentioning
confidence: 99%
“…It has been shown that by properly selecting intermediate arrays and optimizing the and loops, the efficiency of a CCSD code can be increased by a factor of 5 . To a larger extent, high numerical costs associated with the polynomial scaling can be effectively addressed by the development of highly scalable implementations of CC methods, as evidenced by several recent benchmark calculations. Growing interest in efficient utilization of peta- and soon-to-be exa-scale computational resources has stimulated an intensive development of various tensor libraries that can be exploited in generating scalable CC codes for homogeneous as well as for many-core/multicore computer systems. Nevertheless, in all above-mentioned examples of canonical CC implementations the storage requirement will quickly grow as a function of the system size to become a storage and communication bottleneck when going from mid- (10 2 to 10 3 basis functions) to large-scale (10 3 to 10 4 basis functions) CC calculations. Although it has been shown that by employing integral-direct algorithms the storage requirement can be greatly minimized, the integral-direct way might also bring frequent I/O operations and/or the necessity of recalculating “on-the-fly” atomic two-electron integrals, which would then increase the CPU time and deteriorate the scaling with system size.…”
Section: Introductionmentioning
confidence: 99%
“…4,17,61 Currently, we rely on underlying libraries such as Eigen 62 (with interfaces to BLAS implementations) or Libint2 22 (for molecular integral calculations) for achieving parallelization in particular subcalculations. By relying on lower-level adapter-like modules, 61,[63][64][65][66][67][68][69][70] our APIs can, in principle, proceed to scale to the exascale regime. As GQCP's focus is to provide useful generalizations, GQCP could serve as an initiative to further improve inter-module communication between the modules in the current electronic structure software ecosystem.…”
Section: Software Development In Gqcpmentioning
confidence: 99%
“…But since it does not have to support all possible computational workloads across domains, its implementation complexity is also reduced. Here we should mention again that examples of such DS parallel runtimes have existed before, however their architectural design either did not use the concept of DSVP or it introduced it in an ad hoc fashion without derivation from the abstract (base) DSVP architecture supplied with a clear specification.…”
Section: Abstract Dsvpmentioning
confidence: 99%
“…Although the TAVP microarchitecture has its own unique design introducing a number of novel elements such as the fully hierarchical hardware encapsulation, it can also be viewed as a generalization and evolution of earlier efforts, specifically the so‐called Super Instruction Architecture framework used in the ACES‐III and ACES‐IV software suites for expressing and executing quantum many‐body algorithms operating on large dense arrays of numbers. In this retrospective, DSVP is a variant of a Super Instruction Processor (SIP) on an abstract architectural level, but it differs from the previous concrete SIP implementations at the microarchitectural level, that is, its exposed implementation design is different. In fact, the previous SIP works did not seem to expose much of the SIP microarchitectural design, that is, the concrete SIP implementations were not derived as a specialization of a well‐defined microarchitectural design.…”
Section: Introductionmentioning
confidence: 99%