We study the settings for which deep contextual embeddings (e.g., BERT) give large improvements in performance relative to classic pretrained embeddings (e.g., GloVe), and an even simpler baseline-random word embeddings-focusing on the impact of the training set size and the linguistic properties of the task. Surprisingly, we find that both of these simpler baselines can match contextual embeddings on industry-scale data, and often perform within 5 to 10% accuracy (absolute) on benchmark tasks. Furthermore, we identify properties of data for which contextual embeddings give particularly large gains: language containing complex structure, ambiguous word usage, and words unseen in training.
We study large-scale kernel methods for acoustic modeling and compare to DNNs on performance metrics related to both acoustic modeling and recognition. Measuring perplexity and frame-level classification accuracy, kernel-based acoustic models are as effective as their DNN counterparts. However, on token-error-rates DNN models can be significantly better. We have discovered that this might be attributed to DNN's unique strength in reducing both the perplexity and the entropy of the predicted posterior probabilities. Motivated by our findings, we propose a new technique, entropy regularized perplexity, for model selection. This technique can noticeably improve the recognition performance of both types of models, and reduces the gap between them. While effective on Broadcast News, this technique could be also applicable to other tasks.
We study the settings for which deep contextual embeddings (e.g., BERT) give large improvements in performance relative to classic pretrained embeddings (e.g., GloVe), and an even simpler baseline-random word embeddings-focusing on the impact of the training set size and the linguistic properties of the task. Surprisingly, we find that both of these simpler baselines can match contextual embeddings on industry-scale data, and often perform within 5 to 10% accuracy (absolute) on benchmark tasks. Furthermore, we identify properties of data for which contextual embeddings give particularly large gains: language containing complex structure, ambiguous word usage, and words unseen in training. 1 This aligns with recent observations from experiments with classic word embeddings at Apple (Ré et al., 2020).2 These tasks are proprietary, so we share these results anecdotally as motivation for our study.
The impact of blogs and microblogging on the consumption of news is dramatic, as every day users rely more on these sources to decide what content to pay attention to. In this work, we empirically and theoretically analyze the dynamics of bloggers serving as intermediaries between the mass media and the general public. Our first contribution is to precisely describe the receiving and posting behaviors of today's social media users. For the first time, we study jointly the volume and popularity of URLs received and shared by users. We show that social media platforms exhibit a natural ``content curation'' process. Users and bloggers in particular obey two filtering laws : (1) a user who receives less content typically receives more popular content, and (2) a blogger who is less active typically posts disproportionately popular items. Our observations are remarkably consistent across 11 social media data sets. We find evidence of a variety of posting strategies, which motivates our second contribution: a theoretical understanding of the consequences of strategic posting on the stability of social media, and its ability to satisfy the interests of a diverse audience. We introduce a ``blog-positioning game'' and show that it can lead to ``efficient'' equilibria, in which users generally receive the content they are interested in. Interestingly, this model predicts that if users are overly ``picky'' when choosing who to follow, no pure strategy equilibria exists for the bloggers, and thus the game never converges. However, a bit of leniency by the readers in choosing which bloggers to follow is enough to guarantee convergence.
Compressing word embeddings is important for deploying NLP models in memory-constrained settings. However, understanding what makes compressed embeddings perform well on downstream tasks is challenging-existing measures of compression quality often fail to distinguish between embeddings that perform well and those that do not. We thus propose the eigenspace overlap score as a new measure. We relate the eigenspace overlap score to downstream performance by developing generalization bounds for the compressed embeddings in terms of this score, in the context of linear and logistic regression.We then show that we can lower bound the eigenspace overlap score for a simple uniform quantization compression method, helping to explain the strong empirical performance of this method. Finally, we show that by using the eigenspace overlap score as a selection criterion between embeddings drawn from a representative set we compressed, we can efficiently identify the better performing embedding with up to 2× lower selection error rates than the next best measure of compression quality, and avoid the cost of training a model for each task of interest.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.