String similarity models are vital for record linkage, entity resolution, and search. In this work, we present STANCE-a learned model for computing the similarity of two strings. Our approach encodes the characters of each string, aligns the encodings using Sinkhorn Iteration (alignment is posed as an instance of optimal transport) and scores the alignment with a convolutional neural network. We evaluate STANCE's ability to detect whether two strings can refer to the same entity-a task we term alias detection. We construct five new alias detection datasets (and make them publicly available). We show that STANCE (or one of its variants) outperforms both state-of-the-art and classic, parameter-free similarity models on four of the five datasets. We also demonstrate STANCE's ability to improve downstream tasks by applying it to an instance of cross-document coreference and show that it leads to a 2.8 point improvement in B 3 F1 over the previous state-of-the-art approach.
Natural language reduplication can pose a challenge to neural models of language, and has been argued to require variables (Marcus et al., 1999). Sequence-to-sequence neural networks have been shown to perform well at a number of other morphological tasks (Cotterell et al., 2016), and produce results that highly correlate with human behavior (Kirov, 2017; Kirov & Cotterell, 2018) but do not include any explicit variables in their architecture. We find that they can learn a reduplicative pattern that generalizes to novel segments if they are trained with dropout (Srivastava et al., 2014). We argue that this matches the scope of generalization observed in human reduplication.
A current open question in natural language processing is to what extent language models, which are trained with access only to the form of language, are able to capture the meaning of language. In many cases, meaning constrains form in consistent ways. This raises the possibility that some kinds of information about form might reflect meaning more transparently than others. The goal of this study is to investigate under what conditions we can expect meaning and form to covary sufficiently, such that a language model with access only to form might nonetheless succeed in emulating meaning. Focusing on propositional logic, we generate training corpora using a variety of motivated constraints, and measure a distributional language model's ability to differentiate logical symbols (¬, ∧, ∨). Our findings are largely negative: none of our simulated training corpora result in models which definitively differentiate meaningfully different symbols (e.g., ∧ vs. ∨), suggesting a limitation to the types of semantic signals that current models are able to exploit.
Reduplicative linguistic patterns have been used as evidence for explicit algebraic variables in models of cognition.1 Here, we show that a variable-free neural network can model these patterns in a way that predicts observed human behavior. Specifically, we successfully simulate the three experiments presented by Marcus et al. (1999), as well as Endress et al.’s (2007) partial replication of one of those experiments. We then explore the model’s ability to generalize reduplicative mappings to different kinds of novel inputs. Using Berent’s (2013) scopes of generalization as a metric, we claim that the model matches the scope of generalization that has been observed in humans. We argue that these results challenge past claims about the necessity of symbolic variables in models of cognition.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.