Bayesian tensorized neural networks with automatic rank selection

Hawkins, Cole; Zhang, Zheng

doi:10.1016/j.neucom.2021.04.117

Cited by 25 publications

(15 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3) Extensions to Other Tensor Decomposition Models: Similar ideas have been applied to other tensor decomposition models including Tucker decomposition (TuckerD) [47] and tensor train decomposition (TTD) [48]- [50]. In these works, one first assumes an over-parametrized model by setting the model configuration parameters (e.g., multi-linear ranks in TuckerD and TT ranks in TTD) to be large numbers, and then impose GSM prior on the associated model parameters to control the model complexity, see detailed discussions in [47]- [50].…”

Section: Sparsity-aware Modeling For Tensor Decompositionsmentioning

confidence: 99%

See 1 more Smart Citation

Rethinking Bayesian Learning for Data Analysis: The Art of Prior and Inference in Sparsity-Aware Modeling

Liu¹,

Yin²,

Chatzis³

et al. 2022

Preprint

View full text Add to dashboard Cite

Sparse modeling for signal processing and machine learning, in general, has been at the focus of scientific research for over two decades. Among others, supervised sparsity-aware learning comprises two major paths paved by: a) discriminative methods that establish direct input-output mapping based on a regularized cost function optimization, and b) generative methods that learn the underlying distributions.The latter, more widely known as Bayesian methods, enable uncertainty evaluation with respect to the performed predictions. Furthermore, they can better exploit related prior information and also, in principle, can naturally introduce robustness into the model, due to their unique capacity to marginalize out uncertainties related to the parameter estimates. Moreover, hyper-parameters (tuning parameters) associated with the adopted priors, which correspond to cost function regularizers, can be learnt via the training data and not via costly cross-validation techniques, which is, in general, the case with the discriminative methods. To implement sparsity-aware learning, the crucial point lies in the choice of the function regularizer for discriminative methods and the choice of the prior distribution for Bayesian learning. Over the last decade or so, due to the intense research on deep learning, emphasis has been put on discriminative techniques. However, a come back of Bayesian methods is taking place that sheds new light on the design of deep neural networks, which also establish firm links with Bayesian models, such Lei Cheng and Feng Yin contribute equally.

show abstract

Section: Sparsity-aware Modeling For Tensor Decompositionsmentioning

confidence: 99%

“…Since the KL divergence is nonnegative, the equality in (48) holds if and only if it is equal to zero.…”

Section: A Evidence Maximization Frameworkmentioning

confidence: 99%

Rethinking Bayesian Learning for Data Analysis: The Art of Prior and Inference in Sparsity-Aware Modeling

Liu¹,

Yin²,

Chatzis³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…One of the most common method is the Tucker factorization (Cohen et al, 2016), which can generate high-quality DNNs when compressing fully-connected layers. Tensor Train (TT) and Tensor Ring (TR) decomposition techniques have been recently studied in the context of DNNs (Hawkins & Zhang, 2019;Wang et al, 2018). But previous work has explored the accuracy trade-off for fully-connected and convolution layers only.…”

Section: Related Workmentioning

confidence: 99%

TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models

Chunxing¹,

Acun²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

The memory capacity of embedding tables in deep learning recommendation models (DLRMs) is increasing dramatically from tens of GBs to TBs across the industry. Given the fast growth in DLRMs, novel solutions are urgently needed, in order to enable fast and efficient DLRM innovations. At the same time, this must be done without having to exponentially increase infrastructure capacity demands. In this paper, we demonstrate the promising potential of Tensor Train decomposition for DLRMs (TT-Rec), an important yet under-investigated context. We design and implement optimized kernels (TT-EmbeddingBag) to evaluate the proposed TT-Rec design. TT-EmbeddingBag is 3× faster than the SOTA TT implementation. The performance of TT-Rec is further optimized with the batched matrix multiplication and caching strategies for embedding vector lookup operations. In addition, we present mathematically and empirically the effect of weight initialization distribution on DLRM accuracy and propose to initialize the tensor cores of TT-Rec following the sampled Gaussian distribution. We evaluate TT-Rec across three important design space dimensions-memory capacity, accuracy, and timing performance-by training MLPerf-DLRM with Criteo's Kaggle and Terabyte data sets. TT-Rec achieves 117× and 112× model size compression, for Kaggle and Terabyte, respectively. This impressive model size reduction can come with no accuracy nor training time overhead as compared to the uncompressed baseline.Our code is available on Github at facebookresearch/FBTT-Embedding.

show abstract

“…This manuscript is an extended version of our recent work (Hawkins and Zhang, 2021), which reported SVGD training for Bayesian tensorized neural networks. Our manuscript extends Hawkins and Zhang (2021) in the following ways 1. In our previous work, we tested only one Bayesian sampler (SVGD).…”

Section: Introductionmentioning

confidence: 99%

General-Purpose Bayesian Tensor Learning With Automatic Rank Determination and Uncertainty Quantification

Zhang

Hawkins

Zhang

2022

Front. Artif. Intell.

Self Cite

View full text Add to dashboard Cite

A major challenge in many machine learning tasks is that the model expressive power depends on model size. Low-rank tensor methods are an efficient tool for handling the curse of dimensionality in many large-scale machine learning models. The major challenges in training a tensor learning model include how to process the high-volume data, how to determine the tensor rank automatically, and how to estimate the uncertainty of the results. While existing tensor learning focuses on a specific task, this paper proposes a generic Bayesian framework that can be employed to solve a broad class of tensor learning problems such as tensor completion, tensor regression, and tensorized neural networks. We develop a low-rank tensor prior for automatic rank determination in nonlinear problems. Our method is implemented with both stochastic gradient Hamiltonian Monte Carlo (SGHMC) and Stein Variational Gradient Descent (SVGD). We compare the automatic rank determination and uncertainty quantification of these two solvers. We demonstrate that our proposed method can determine the tensor rank automatically and can quantify the uncertainty of the obtained results. We validate our framework on tensor completion tasks and tensorized neural network training tasks.

show abstract

Bayesian tensorized neural networks with automatic rank selection

Cited by 25 publications

References 41 publications

Rethinking Bayesian Learning for Data Analysis: The Art of Prior and Inference in Sparsity-Aware Modeling

Rethinking Bayesian Learning for Data Analysis: The Art of Prior and Inference in Sparsity-Aware Modeling

TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models

General-Purpose Bayesian Tensor Learning With Automatic Rank Determination and Uncertainty Quantification

Contact Info

Product

Resources

About