The present paper intends to draw the conception of language implied in the technique of word embeddings that supported the recent development of deep neural network models in computational linguistics. After a preliminary presentation of the basic functioning of elementary artificial neural networks, we introduce the motivations and capabilities of word embeddings through one of its pioneering models, word2vec. To assess the remarkable results of the latter, we inspect the nature of its underlying mechanisms, which have been characterized as the implicit factorization of a word-context matrix. We then discuss the ordinary association of the "distributional hypothesis" with a "use theory of meaning", often justifying the theoretical basis of word embeddings, and contrast them to the theory of meaning stemming from those mechanisms through the lens of matrix models (such as VSMs and DSMs). Finally, we trace back the principles of their possible consistency through Harris's original distributionalism up to the structuralist conception of language of Saussure and Hjelmslev. Other than giving access to the technical literature and state of the art in the field of Natural Language Processing to non-specialist readers, the paper seeks to reveal the conceptual and philosophical stakes involved in the recent application of new neural network techniques to the computational treatment of language. Why can computers understand natural language?Il n'y a pas de "philosophie" du langage. Il n'y a que la linguistique.Louis Hjelmslev Principes de Grammaire Générale, 1928 1 I borrow this expression from Maniglier (2016, p. 359), who in turn takes inspiration from Deleuze's notion of "image of thought" (Deleuze, 1994, ch. III).2 See for instance Hale and Wright (1997). 3 See for instance Christopher Manning and Richard Socher's tutorial "Deep Learning for 23 Details of the different kinds of semantic and syntactic analogy relations can be found in Mikolov et al. (2013d); Schnabel et al. (2015).
The recent success of deep neural network techniques in natural language processing rely heavily on the so-called distributional hypothesis. We suggest that the latter can be understood as a simplified version of the classic structuralist hypothesis, at the core of a programme aiming at reconstructing grammatical structures from first principles and corpus analysis. Then, we propose to reinterpret the structuralist programme with insights from proof theory, especially associating paradigmatic relations and units with formal types defined through an appropriate notion of interaction. In this way, we intend to build original conceptual bridges between computational logic and classic structuralism, which can contribute to understanding the recent advances in NLP.
Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1 σ(µ ⋆ ) (1 − e −σ(µ ⋆ ) )-approximation of an optimal merge sequence, where σ(µ ⋆ ) is the total backward curvature with respect to the optimal merge sequence µ ⋆ . Empirically the lower bound of the approximation is ≈ 0.37.We provide a faster implementation of BPE which improves the runtime complexity from O (N M ) to O (N log M ), where N is the sequence length and M is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.
Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in information-theoretic terms as the ratio of the Shannon entropy to the maximum possible entropy of the token distribution. Yet, an optimal encoding according to Shannon entropy assigns extremely long codes to low-frequency tokens and very short codes to high-frequency tokens. Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very low-frequency tokens. In machine translation, we find that across multiple tokenizers, the Rényi entropy with α = 2.5 has a very strong correlation with BLEU: 0.78 in comparison to just −0.32 for compressed length.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.