TABLA: A unified template-based framework for accelerating statistical machine learning

Mahajan, Divya; Park, Jongse; Amaro, Emmanuel; Sharma, Hardik; Yazdanbakhsh, Amir; Kim, Joon-Kyung; Esmaeilzadeh, Hadi

doi:10.1109/hpca.2016.7446050

Cited by 138 publications

(61 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The only amendment we made in the device architecture was to increase the capacity of I/O pads from 2 to 4 as our benchmarks are heavily I/O bound. Our benchmarks include Tabla [13], DnnWeaver [14], DianNao [9], Stripes [45], and Proteus [46] which are general neural network acceleration frameworks capable of optimizing various objective functions through gradient descent by supporting huge Figure 10 compares the achieved power gain of different voltage scaling approaches implemented the Tabla acceleration framework under a varying workload. We considered a synthetic workload with 40% average load (of the maximum) from [47] with λ = 1000, H = 0.76 and IDC = 500 where λ, 0.5 < H ≤ 1 and IDC denote the average arrival rate of the whole process, Hurst exponent, and the index of dispersion, respectively.…”

Section: A General Setupmentioning

confidence: 99%

“…Unfortunately, they are limited to a specific subset of applications while the applications and/or implementation of data centers evolve with a high pace. Thanks to their relatively lower power consumption, finegrained parallelism, and programmability, in the last few years, Field-Programmable Gate Arrays (FPGAs) have shown great performance in various applications [10], [11], [12], [13], [14]. Therefore, they have been integrated in data centers to accelerate the data center applications.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Workload-Aware Opportunistic Energy Efficiency in Multi-FPGA Platforms

Salamat

Khaleghi

Imani

et al. 2019

2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

View full text Add to dashboard Cite

The continuous growth of big data applications with high computational and scalability demands has resulted in increasing popularity of cloud computing. Optimizing the performance and power consumption of cloud resources is therefore crucial to relieve the costs of data centers. In recent years, multi-FPGA platforms have gained traction in data centers as lowcost yet high-performance solutions particularly as acceleration engines, thanks to the high degree of parallelism they provide. Nonetheless, the size of data centers workloads varies during service time, leading to significant underutilization of computing resources while consuming a large amount of power, which turns out as a key factor of data center inefficiency, regardless of the underlying hardware structure. In this paper, we propose an efficient framework to throttle the power consumption of multi-FPGA platforms by dynamically scaling the voltage and hereby frequency during runtime according to prediction of, and adjustment to the workload level, while maintaining the desired Quality of Service (QoS). This is in contrast to, and more efficient than, conventional approaches that merely scale (i.e., power-gate) the computing nodes or frequency. The proposed framework carefully exploits a pre-characterized library of delay-voltage, and power-voltage information of FPGA resources, which we show is indispensable to obtain the efficient operating point due to the different sensitivity of resources w.r.t. voltage scaling, particularly considering multiple power rails residing in these devices. Our evaluations by implementing state-of-the-art deep neural network accelerators revealed that, providing an average power reduction of 4.0×, the proposed framework surpasses the previous works by 33.6% (up to 83%).

show abstract

Section: A General Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Workload-Aware Opportunistic Energy Efficiency in Multi-FPGA Platforms

Salamat

Khaleghi

Imani

et al. 2019

2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

View full text Add to dashboard Cite

show abstract

“…Once the analyst imports the dana package, she can express the required variables. The code snippet below declares a multidimensional ML model of size [5][2] using dana.model construct.…”

Section: Language Constructsmentioning

confidence: 99%

In-RDBMS hardware acceleration of advanced analytics

et al. 2018

Self Cite

View full text Add to dashboard Cite

The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As such, there is a void of solutions that enables hardware acceleration at the intersection of these disjoint fields. This paper sets out to be the initial step towards a unifying solution for in-Database Acceleration of Advanced Analytics (DAnA). Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of advanced analytics queries to an FPGA accelerator. The accelerator implementation is generated for a User Defined Function (UDF), expressed as a part of an SQL query using a Pythonembedded Domain-Specific Language (DSL). To realize an efficient in-database integration, DAnA accelerators contain a novel hardware structure, Striders, that directly interface with the buffer pool of the database. Striders extract, cleanse, and process the training data tuples that are consumed by a multi-threaded FPGA engine that executes the analytics algorithm. We integrate DAnA with PostgreSQL to generate hardware accelerators for a range of real-world and synthetic datasets running diverse ML algorithms. Results show that DAnA-enhanced PostgreSQL provides, on average, 8.3× end-to-end speedup for real datasets, with a maximum of 28.2×. Moreover, DAnA-enhanced PostgreSQL is, on average, 4.0× faster than the multi-threaded Apache MADLib running on Greenplum. DAnA provides these benefits while hiding the complexity of hardware design from data scientists and allowing them to express the algorithm in ≈30-60 lines of Python.

show abstract

“…Large-scale neural networks are both memory-intensive and computation-intensive, thereby posing stringent requirements on the computing platforms when deploying those large-scale neural network models on memory-constrained and energyconstrained embedded devices. In order to overcome these limitations, the hardware accelerations of deep neural networks have been extensively investigated in both industry and academia [1], [2], [3], [4], [5], [6], [7], [8]. These hardware accelerations are based on FPGA and ASIC devices and can achieve a significant improvement on energy efficiency, along with small form factor, compared with traditional CPU or GPU based computing of deep neural networks.…”

Section: Introductionmentioning

confidence: 99%

Universal Approximation Property and Equivalence of Stochastic Computing-Based Neural Networks and Binary Neural Networks

Wang

Zhan

Zhao

et al. 2019

AAAI

View full text Add to dashboard Cite

Large-scale deep neural networks are both memory and computation-intensive, thereby posing stringent requirements on the computing platforms. Hardware accelerations of deep neural networks have been extensively investigated. Specific forms of binary neural networks (BNNs) and stochastic computing-based neural networks (SCNNs) are particularly appealing to hardware implementations since they can be implemented almost entirely with binary operations.Despite the obvious advantages in hardware implementation, these approximate computing techniques are questioned by researchers in terms of accuracy and universal applicability. Also it is important to understand the relative pros and cons of SCNNs and BNNs in theory and in actual hardware implementations. In order to address these concerns, in this paper we prove that the "ideal" SCNNs and BNNs satisfy the universal approximation property with probability 1 (due to the stochastic behavior). The proof is conducted by first proving the property for SCNNs from the strong law of large numbers, and then using SCNNs as a "bridge" to prove for BNNs. Based on the universal approximation property, we further prove that SCNNs and BNNs exhibit the same energy complexity. In other words, they have the same asymptotic energy consumption with the growing of network size. We also provide a detailed analysis of the pros and cons of SCNNs and BNNs for hardware implementations and conclude that SCNNs are more suitable.

show abstract

TABLA: A unified template-based framework for accelerating statistical machine learning

Cited by 138 publications

References 47 publications

Workload-Aware Opportunistic Energy Efficiency in Multi-FPGA Platforms

Workload-Aware Opportunistic Energy Efficiency in Multi-FPGA Platforms

In-RDBMS hardware acceleration of advanced analytics

Universal Approximation Property and Equivalence of Stochastic Computing-Based Neural Networks and Binary Neural Networks

Contact Info

Product

Resources

About