Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Wu, Liwei; Li, Shuqing; Hsieh, Cho‐Jui; Sharpnack, James

doi:10.48550/arxiv.1905.10630

Cited by 3 publications

(19 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[16] found that adding additional personalized embeddings did not improve the performance of their Transformer model, and postulate that this is due to the fact that they already use the user history and the embeddings only contribute to overfitting. Although introducing user embeddings into the model is indeed difficult with existing regularization techniques for embeddings, we show that personalization can greatly improve ranking performances with recent regularization technique called Stochastic Shared Embeddings (SSE) [41]. The personalized Transformer (SSE-PT) model with SSE regularization works well for all 5 realworld datasets we consider, outperforming previous state-of-the-art algorithm SASRec by almost 5% in terms of NDCG@10.…”

Section: Introductionmentioning

confidence: 90%

“…There are many other regularization techniques, including parameter sharing [5], max-norm regularization [31], gradient clipping [25], etc. Very recently, a new regularization technique called Stochastic Shared Embeddings (SSE) [41] is proposed as a new means of regularizing embedding layers. [41] develops two versions of SSE, SSE-Graph and SSE-SE.…”

Section: Regularization Techniquesmentioning

confidence: 99%

“…Very recently, a new regularization technique called Stochastic Shared Embeddings (SSE) [41] is proposed as a new means of regularizing embedding layers. [41] develops two versions of SSE, SSE-Graph and SSE-SE. We find that SSE-SE is essential to the success of our Personalized Transformer (SSE-PT) model.…”

Section: Regularization Techniquesmentioning

confidence: 99%

“…Unlike previous SASRec model [16], we use one more regularization technique in our SSE-PT model specifically for embedding layer in addition to the ones listed earlier: the Stochastic Shared Embeddings (SSE) [41]. The reason that we want to use this additional regularization technique is that we find the existing well-known regularization techniques like layer normalization, dropout and weight decay cannot prevent the model from over-fitting badly after introducing user embeddings.…”

Section: Stochastic Shared Embeddingsmentioning

confidence: 99%

“…The main idea of SSE is to stochastically replace embeddings with another embedding during SGD, which has the effect of regularizing the embedding layers. Specifically, SSE-SE replaces one embedding with another embedding stochastically with probability p, which is called SSE probability in [41]. There are 3 different places in our model that SSE-SE can be applied.…”

Section: Stochastic Shared Embeddingsmentioning

confidence: 99%

See 4 more Smart Citations

Temporal Collaborative Ranking Via Personalized Transformer

Wu,

Li,

Hsieh

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

The collaborative ranking problem has been an important open research question as most recommendation problems can be naturally formulated as ranking problems. While much of collaborative ranking methodology assumes static ranking data, the importance of temporal information to improving ranking performance is increasingly apparent. Recent advances in deep learning, especially the discovery of various attention mechanisms and newer architectures in addition to widely used RNN and CNN in natural language processing, have allowed us to make better use of the temporal ordering of items that each user has engaged with. In particular, the SASRec model, inspired by the popular Transformer model in natural languages processing, has achieved state-of-art results in the temporal collaborative ranking problem and enjoyed more than 10x speed-up when compared to earlier CNN/RNNbased methods. However, SASRec is inherently an un-personalized model and does not include personalized user embeddings. To overcome this limitation, we propose a Personalized Transformer (SSE-PT) model, outperforming SASRec by almost 5% in terms of NDCG@10 on 5 real-world datasets. Furthermore, after examining some random users' engagement history and corresponding attention heat maps used during the inference stage, we find our model is not only more interpretable but also able to focus on recent engagement patterns for each user. Moreover, our SSE-PT model with a slight modification, which we call SSE-PT++, can handle extremely long sequences and outperform SASRec in ranking results with comparable training speed, striking a balance between performance and speed requirements. Code and data are open sourced at https://github. com/wuliwei9278/SSE-PT.

show abstract

Section: Introductionmentioning

confidence: 90%

Section: Regularization Techniquesmentioning

confidence: 99%

Section: Regularization Techniquesmentioning

confidence: 99%

Section: Stochastic Shared Embeddingsmentioning

confidence: 99%

Section: Stochastic Shared Embeddingsmentioning

confidence: 99%

See 3 more Smart Citations

Temporal Collaborative Ranking Via Personalized Transformer

Wu,

Li,

Hsieh

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

S-BERT: A BERT-based personalised recommendation model

CHEN,

Fu,

Wang

et al. 2024

Third International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2024)

View full text Add to dashboard Cite

Sequential dynamics of users is an important feature of modern sequential recommender systems. It helps recommender systems to capture 'context' information about the user's activities. The Markov chain assumes that the user's next action is determined only by the current action and is uncorrelated with all previous actions in the time series. The RNN, on the other hand, is able to capture longer user sequences. In recent years, with the advancement of deep learning techniques, especially the various attention mechanisms and their architectures, it is possible to learn better user behaviour sequences. However, despite the great progress in sequence recommendation based on the above methods, both Markov chains, RNNs and various models based on attention mechanisms are encoded strictly according to user behaviour sequences, which is a unidirectional structure that limits the representation of user behaviour sequences, and the unidirectional structure is not practical in real life. In addition, these existing model structures have another drawback, they are all too idealized, thinking that the global preference of a user can be learned only from the user's behaviour sequences, in fact, even the attention mechanism can only extract the user's interest in the short term. To address the above issues, we propose a model, S-BERT, which is based on the BERT model and further improved to be more applicable in recommendation tasks. The model uses deep bidirectional self-attention to model the user's behavioural sequences, which is used to learn the user's short-term preferences, and the user's own embedding is added to learn the user's long-term preferences. The S-BERT model outperforms the same BERT-based model, BERT4Rec, with three real data sets and the NDCG@10 metric by about 2%.

show abstract

Advances in Collaborative Filtering and Ranking

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

Cited by 3 publications

References 19 publications

Temporal Collaborative Ranking Via Personalized Transformer

Temporal Collaborative Ranking Via Personalized Transformer

S-BERT: A BERT-based personalised recommendation model

Advances in Collaborative Filtering and Ranking

Contact Info

Product

Resources

About