Star-Transformer

Guo, Qipeng; Qiu, Xipeng; Liu, Pengfei; Shao, Yunfan; Xue, Xiangyang; Zhang, Zheng

doi:10.48550/arxiv.1902.09113

Cited by 27 publications

(42 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sukhbaatar et al (2019) proposed a adaptive mechanism to learn optimal context length in transformers for each head per layer, thus reducing the total computational and memory cost of transformers. Guo et al (2019) suggest that the fullyconnected nature of self-attention in Transformer is not a good inductive bias, they proposed Star-Transformer which links adjacency words coupled with a central relay node to capture both local and global dependencies, with such reduction, Star-Transformer achieved significant improvements against standard Transformer on moderate sized datasets. However, Star-Transformer is not suitable for auto-regressive models in which each word should only be conditioned on its previous words, while the relay node in Star-Transformer summarizes the whole sequence.…”

Section: Lightweight Self-attentionmentioning

confidence: 99%

“…(1) Hierarchical Transformers (Miculicich et al, 2018;Liu and Lapata, 2019) uses two Transformers in a hierarchical architecture: one Transformer models the sentence representation with word-level context, and another the document representation with the sentence-level context. (2) Lightweight Transformers (Child et al, 2019;Sukhbaatar et 2019; Guo et al, 2019;Dai et al, 2019) reduce the complexity by reconstructing the connections between tokens.…”

Section: Introductionmentioning

confidence: 99%

“…The dependency relations between tokens are totally learned from scratch. Therefore, Transformer usually performs better on huge datasets and is easy to overfit on small datasets (Guo et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

Ye,

Guo,

Gan

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

The Transformer model is widely successful on many natural language processing tasks. However, the quadratic complexity of selfattention limit its application on long text. In this paper, adopting a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning (BP), we propose BP-Transformer (BPT for short). BPT yields O(k • n log(n/k)) connections where k is a hyperparameter to control the density of attention. BPT has a good balance between computation complexity and model capacity. A series of experiments on text classification, machine translation and language modeling shows BPT has a superior performance for long text than previous self-attention models. Our code, hyperparameters and CUDA kernels for sparse attention are available in PyTorch 1 .

show abstract

Section: Lightweight Self-attentionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

Ye,

Guo,

Gan

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The question is therefore how to take two-dimensional locality into account. We could create two-dimensional attention patterns directly on a grid but this would have significant computational overhead and also prevent us from extending one dimensional sparsifications that are known to work well [12,6]. Instead, we modify one dimensional sparsifications to become aware of two-dimensional locality with the following trick: (i) we enumerate pixels of the image based on their Manhattan distance from the pixel at location (0, 0) (breaking ties using row priority), (ii) shift the indices of any given one-dimensional sparsification to match the Manhattan distance enumeration instead of the reshape enumeration, and (iii) apply this new one dimensional sparsification pattern, that respects two-dimensional locality, to the one-dimensional reshaped version of the image.…”

Section: Two-dimensional Localitymentioning

confidence: 99%

Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models

Daras¹,

Odena²,

Zhang³

et al. 2019

Preprint

View full text Add to dashboard Cite

We introduce a new local sparse attention layer that preserves two-dimensional geometry and locality. We show that by just replacing the dense attention layer of SAGAN with our construction, we obtain very significant FID, Inception score and pure visual improvements. FID score is improved from 18.65 to 15.94 on ImageNet, keeping all other parameters the same. The sparse attention patterns that we propose for our new layer are designed using a novel information theoretic criterion that uses information flow graphs.We also present a novel way to invert Generative Adversarial Networks with attention. Our method uses the attention layer of the discriminator to create an innovative loss function. This allows us to visualize the newly introduced attention heads and show that they indeed capture interesting aspects of two-dimensional geometry of real images.

show abstract

“…Since the transformer-based approaches [14,15] adaptively designate divergent and interpretable attention to past embedding over time, the performance on long time duration is increased than RNN-based DGNNs. However, because the standard transformer [21] contains fully-connected attention conjunction N 2 , where N is the number of temporal patches, it causes the heavy computation on a time-dependent sequence [22]. We hope to simplify and effectively convey temporal information along the time dimension and achieve acceptable performance under inductive and transductive link prediction tasks.…”

Section: Introductionmentioning

confidence: 99%

Sparse-Dyn: Sparse Dynamic Graph Multi-representation Learning via Event-based Sparse Temporal Attention Network

Yan¹,

Liu²

2022

Preprint

View full text Add to dashboard Cite

GStatic graph neural networks have been widely used in modeling and representation learning of graph structure data. However, many real-world problems, such as social networks, financial transactions, recommendation systems, etc., are dynamic, that is, nodes and edges are added or deleted over time. Therefore, in recent years, dynamic graph neural networks have received more and more attention from researchers. In this work, we propose a novel dynamic graph neural network, Efficient-Dyn. It adaptively encodes temporal information into a sequence of patches with an equal amount of temporal-topological structure. Therefore, while avoiding the use of snapshots to cause information loss, it also achieves a finer time granularity, which is close to what continuous networks could provide. In addition, we also designed a lightweight module, Sparse Temporal Transformer, to compute node representations through both structural neighborhoods and temporal dynamics. Since the fully-connected attention conjunction is simplified, the computation cost is far lower than the current state-of-the-arts. Link prediction experiments are conducted on both continuous and discrete graph datasets. Through comparing with several state-of-the-art graph embedding baselines, the experimental results demonstrate that Efficient-Dyn has a faster inference speed while having competitive performance.

show abstract

Star-Transformer

Cited by 27 publications

References 0 publications

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models

Sparse-Dyn: Sparse Dynamic Graph Multi-representation Learning via Event-based Sparse Temporal Attention Network

Contact Info

Product

Resources

About