Despite their ubiquity, word embeddings trained with skip-gram negative sampling (SGNS) remain poorly understood. We find that vector positions are not simply determined by semantic similarity, but rather occupy a narrow cone, diametrically opposed to the context vectors. We show that this geometric concentration depends on the ratio of positive to negative examples, and that it is neither theoretically nor empirically inherent in related embedding algorithms.
Program equivalence is a fundamental problem that has practical applications across a variety of areas of computing including compilation, optimization, software synthesis, formal verification, and many others. Equivalence is undecidable in general, but in certain settings it is possible to develop domain-specific languages that are expressive enough to be practical and yet sufficiently restricted so that equivalence remains decidable.In previous work we introduced NetKAT, a domain-specific language for specifying and verifying network packet-processing functions. NetKAT provides familiar constructs such as tests, assignments, union, sequential composition, and iteration as well as custom primitives for modifying packet headers and encoding network topologies. Semantically, NetKAT is based on Kleene algebra with tests (KAT) and comes equipped with a sound and complete equational theory. Although NetKAT equivalence is decidable, the best known algorithm is hardly practical-it uses Savitch's theorem to determinize a PSPACE algorithm and requires quadratic space. This paper presents a new algorithm for deciding NetKAT equivalence. This algorithm is based on finding bisimulations between finite automata constructed from NetKAT programs. We investigate the coalgebraic theory of NetKAT, generalize the notion of Brzozowski derivatives to NetKAT, develop efficient representations of NetKAT automata in terms of spines and sparse matrices, and discuss the highlights of our prototype implementation.
Phrase representations derived from BERT often do not exhibit complex phrasal compositionality, as the model relies instead on lexical similarity to determine semantic relatedness. In this paper, we propose a contrastive fine-tuning objective that enables BERT to produce more powerful phrase embeddings. Our approach (Phrase-BERT) relies on a dataset of diverse phrasal paraphrases, which is automatically generated using a paraphrase generation model, as well as a large-scale dataset of phrases in context mined from the Books3 corpus. Phrase-BERT outperforms baselines across a variety of phrase-level similarity tasks, while also demonstrating increased lexical diversity between nearest neighbors in the vector space. Finally, as a case study, we show that Phrase-BERT embeddings can be easily integrated with a simple autoencoder to build a phrase-based neural topic model that interprets topics as mixtures of words and phrases by performing a nearest neighbor search in the embedding space. Crowdsourced evaluations demonstrate that this phrase-based topic model produces more coherent and meaningful topics than baseline word and phrase-level topic models, further validating the utility of Phrase-BERT.
Duplicate documents are a pervasive problem in text datasets and can have a strong effect on unsupervised models. Methods to remove duplicate texts are typically heuristic or very expensive, so it is vital to know when and why they are needed. We measure the sensitivity of two latent semantic methods to the presence of different levels of document repetition. By artificially creating different forms of duplicate text we confirm several hypotheses about how repeated text impacts models. While a small amount of duplication is tolerable, substantial over-representation of subsets of the text may overwhelm meaningful topical patterns.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.