2019
DOI: 10.48550/arxiv.1902.09113
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Star-Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
42
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 27 publications
(42 citation statements)
references
References 0 publications
0
42
0
Order By: Relevance
“…Sukhbaatar et al (2019) proposed a adaptive mechanism to learn optimal context length in transformers for each head per layer, thus reducing the total computational and memory cost of transformers. Guo et al (2019) suggest that the fullyconnected nature of self-attention in Transformer is not a good inductive bias, they proposed Star-Transformer which links adjacency words coupled with a central relay node to capture both local and global dependencies, with such reduction, Star-Transformer achieved significant improvements against standard Transformer on moderate sized datasets. However, Star-Transformer is not suitable for auto-regressive models in which each word should only be conditioned on its previous words, while the relay node in Star-Transformer summarizes the whole sequence.…”
Section: Lightweight Self-attentionmentioning
confidence: 99%
See 2 more Smart Citations
“…Sukhbaatar et al (2019) proposed a adaptive mechanism to learn optimal context length in transformers for each head per layer, thus reducing the total computational and memory cost of transformers. Guo et al (2019) suggest that the fullyconnected nature of self-attention in Transformer is not a good inductive bias, they proposed Star-Transformer which links adjacency words coupled with a central relay node to capture both local and global dependencies, with such reduction, Star-Transformer achieved significant improvements against standard Transformer on moderate sized datasets. However, Star-Transformer is not suitable for auto-regressive models in which each word should only be conditioned on its previous words, while the relay node in Star-Transformer summarizes the whole sequence.…”
Section: Lightweight Self-attentionmentioning
confidence: 99%
“…(1) Hierarchical Transformers (Miculicich et al, 2018;Liu and Lapata, 2019) uses two Transformers in a hierarchical architecture: one Transformer models the sentence representation with word-level context, and another the document representation with the sentence-level context. (2) Lightweight Transformers (Child et al, 2019;Sukhbaatar et 2019; Guo et al, 2019;Dai et al, 2019) reduce the complexity by reconstructing the connections between tokens.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The question is therefore how to take two-dimensional locality into account. We could create two-dimensional attention patterns directly on a grid but this would have significant computational overhead and also prevent us from extending one dimensional sparsifications that are known to work well [12,6]. Instead, we modify one dimensional sparsifications to become aware of two-dimensional locality with the following trick: (i) we enumerate pixels of the image based on their Manhattan distance from the pixel at location (0, 0) (breaking ties using row priority), (ii) shift the indices of any given one-dimensional sparsification to match the Manhattan distance enumeration instead of the reshape enumeration, and (iii) apply this new one dimensional sparsification pattern, that respects two-dimensional locality, to the one-dimensional reshaped version of the image.…”
Section: Two-dimensional Localitymentioning
confidence: 99%
“…Since the transformer-based approaches [14,15] adaptively designate divergent and interpretable attention to past embedding over time, the performance on long time duration is increased than RNN-based DGNNs. However, because the standard transformer [21] contains fully-connected attention conjunction N 2 , where N is the number of temporal patches, it causes the heavy computation on a time-dependent sequence [22]. We hope to simplify and effectively convey temporal information along the time dimension and achieve acceptable performance under inductive and transductive link prediction tasks.…”
Section: Introductionmentioning
confidence: 99%