SSN: Learning Sparse Switchable Normalization via SparsestMax

Shao, Wei; Li, Jingyu; Ren, Jinchang; Zhang, Ruimao; Wang, Xiaogang; Luo, Ping

doi:10.1007/s11263-019-01269-y

Cited by 6 publications

(6 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since its output sums to one, this invariably means less weight is assigned to the relevant items, potentially harming performance and interpretability (Jain and Wallace, 2019). This has motivated a line of research on learning networks with sparse mappings (Martins and Astudillo, 2016;Niculae and Blondel, 2017;Louizos et al, 2018;Shao et al, 2019). We focus on a recently-introduced flexible family of transformations, α-entmax (Blondel et al, 2019;Peters et al, 2019), defined as:…”

Section: Sparse Attentionmentioning

confidence: 99%

See 1 more Smart Citation

Adaptively Sparse Transformers

Correia¹,

Niculae²,

Martins³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

125

View full text Add to dashboard Cite

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multiheaded attention. The multiple heads learn diverse types of word relationships. However, with standard softmax attention, all attention heads are dense, assigning a non-zero weight to all context words. In this work, we introduce the adaptively sparse Transformer, wherein attention heads have flexible, contextdependent sparsity patterns. This sparsity is accomplished by replacing softmax with αentmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the α parameter -which controls the shape and sparsity of αentmax -allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets. Findings of the quantitative and qualitative analysis of our approach include that heads in different layers learn different sparsity preferences and tend to be more diverse in their attention distributions than softmax Transformers. Furthermore, at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.

show abstract

Section: Sparse Attentionmentioning

confidence: 99%

“…Sparse attention. Prior work has developed sparse attention mechanisms, including applications to NMT (Martins and Astudillo, 2016;Malaviya et al, 2018;Niculae and Blondel, 2017;Shao et al, 2019;Maruf et al, 2019). Peters et al (2019) introduced the entmax function this work builds upon.…”

Section: Related Workmentioning

confidence: 99%

Adaptively Sparse Transformers

Correia¹,

Niculae²,

Martins³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

125

View full text Add to dashboard Cite

show abstract

“…(6). The optimization of binary variables has been well established in the literature [22,18,17,26], which can be also used to train DGConv. The gate params are optimized by Straight-Through Estimator similar to recent network quantization approaches, which is guaranteed to converge [5].…”

Section: Construction Of the Relationship Matrixmentioning

confidence: 99%

Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

Zhang

Li²,

Shao

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

Group convolution, which divides the channels of Con-vNets into groups, has achieved impressive improvement over the regular convolution operation. However, existing models, e.g. ResNeXt, still suffers from the sub-optimal performance due to manually defining the number of groups as a constant over all of the layers. Toward addressing this issue, we present Groupable ConvNet (GroupNet) built by using a novel dynamic grouping convolution (DGConv) operation, which is able to learn the number of groups in an end-to-end manner. The proposed approach has several appealing benefits. (1) DGConv provides a unified convolution representation and covers many existing convolution operations such as regular dense convolution, group convolution, and depthwise convolution. (2) DGConv is a differentiable and flexible operation which learns to perform various convolutions from training data. (3) GroupNet trained with DGConv learns different number of groups for different convolution layers. Extensive experiments demonstrate that GroupNet outperforms its counterparts such as ResNet and ResNeXt in terms of accuracy and computational complexity. We also present introspection and reproducibility study, for the first time, showing the learning dynamics of training group numbers.

show abstract

“…When dealing with shallow features, since IN has good robustness at this case, it is used as the primary normalization method, so the weight of IN is larger than BN and LN, and all weights are learned through backpropagation. When extracting deep features, the weight of BN is relatively large, which improves the expression ability of the proposed self-adaptive normalization after feature processing [21].…”

Section: Methodsmentioning

confidence: 99%

Instance Segmentation Based on Improved Self-Adaptive Normalization

Yang

Wang

Yang

et al. 2022

Sensors

View full text Add to dashboard Cite

The single batch normalization (BN) method is commonly used in the instance segmentation algorithms. The batch size is concerned with some drawbacks. A too small sample batch size leads to a sharp drop in accuracy, but a too large batch may result in the memory overflow of graphic processing units (GPU). These problems make BN not feasible to some instance segmentation tasks with inappropriate batch sizes. The self-adaptive normalization (SN) method, with an adaptive weight loss layer, shows good performance in instance segmentation algorithms, such as the YOLACT. However, the parameter averaging mechanism in the SN method is prone to problems in the weight learning and assignment process. In response to such a problem, the paper proposes to replace the single BN with an adaptive weight loss layer in SN models, based on which a weight learning method is developed. The proposed method increases the input feature expression ability of the subsequent layers. By building a Pytorch deep learning framework, the proposed method is validated in the MS-COCO data set and Autonomous Driving Cityscapes data set. The experimental results prove that the proposed method is effective in processing samples independent from the batch size. The stable accuracy for all kinds of target segmentation is achieved, and the overall loss value is significantly reduced at the same time. The convergence speed of the network is also improved.

show abstract

SSN: Learning Sparse Switchable Normalization via SparsestMax

Cited by 6 publications

References 42 publications

Adaptively Sparse Transformers

Adaptively Sparse Transformers

Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

Instance Segmentation Based on Improved Self-Adaptive Normalization

Contact Info

Product

Resources

About