In this work, we present a naïve initialization scheme for word vectors based on a dense, independent co-occurrence model and provide preliminary results that suggests it is competitive, and warrants further investigation. Specifically, we demonstrate through informationtheoretic minimum description length (MDL) probing that our model, EigenNoise, can approach the performance of empirically trained GloVe despite the lack of any pre-training data (in the case of EigenNoise). We present these preliminary results with interest to set the stage for further investigations into how this competitive initialization works without pre-training data, as well as to invite the exploration of more intelligent initialization schemes informed by the theory of harmonic linguistic structure. Our application of this theory likewise contributes a novel (and effective) interpretation of recent discoveries which have elucidated the underlying distributional information that linguistic representations capture from data and contrast distributions.
The Princeton WordNet is a powerful tool for studying language and developing natural language processing algorithms. With significant work developing it further, one line considers its extension through aligning its expert-annotated structure with other lexical resources. In contrast, this work explores a completely data-driven approach to network construction, forming a wordnet using the entirety of the open-source, noisy, userannotated dictionary, Wiktionary. Comparing baselines to WordNet, we find compelling evidence that our network induction process constructs a network with useful semantic structure. With thousands of semanticallylinked examples that demonstrate sense usage from basic lemmas to multiword expressions (MWEs), we believe this work motivates future research.
The development of state-of-the-art (SOTA) Natural Language Processing (NLP) systems has steadily been establishing new techniques to absorb the statistics of linguistic data. These techniques often trace well-known constructs from traditional theories, and we study these connections to close gaps around key NLP methods as a means to orient future work. For this, we introduce an analytic model of the statistics learned by seminal algorithms (including GloVe and Word2Vec), and derive insights for systems that use these algorithms and the statistics of co-occurrence, in general. In this work, we derive-to the best of our knowledge-the first known solution to Word2Vec's softmax-optimized, skip-gram algorithm. This result presents exciting potential for future development as a direct solution to a deep learning (DL) language model's (LM's) matrix factorization. However, we use the solution to demonstrate a seemingly-universal existence of a property that word vectors exhibit and which allows for the prophylactic discernment of biases in data-prior to their absorption by DL models. To qualify our work, we conduct an analysis of independence, i.e., on the density of statistical dependencies in cooccurrence models, which in turn renders insights on the distributional hypothesis' partial fulfillment by co-occurrence statistics. MotivationSuppose one wished to randomly optimize a Rube Goldberg machine (RGM) over many Dominoes with the intent of accomplishing a small downstream task. Should the RGM be initialized to a random state, with dominoes scattered haphazardly, i.e., with no prior? Or would it help more to constrain the RGM to initializations with all dominoes standing on end? Perhaps less effort could be used to modify the dominoes-on-end state for the goalbut that depends on the goal and how dominoes can be used to transfer energy over long ranges.Pre-trained models are often used as initializations, eventually applied to downstream NLP tasks like part-of-speech tagging or machine translation. This means model pre-training is a lot like initializing an RGM to a highly-potentiated state, while retaining a flexibility/generality to optimize sharply for the diversity of phenomena which can depend on statistical, linguistic information. A challenge partly met by big data pre-training is with the need for models to remain useful on a large diversity of data and tasks. Under the RGM theory, pre-training over big data simply potentiates more dominoes, in more-usefully correlated ways, where 'useful' is hands-off defined by a model's parametric ability to explain language, i.e., which words were where. However, if we knew how many dominoes should be on end at the start and how many dominoes should be in configurations that make stairs, etc., it seems plausible to initialize the RGM with distributionally-useful tools, given what we know about how humans use dominoes to transfer energy, i.e., the statistics of how humans use vocabularies to communicate. We investigate these questions, replacing 'domino' with 'parameter', an...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.