We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.computational linguistics ͉ grammar induction ͉ language acquisition ͉ machine learning ͉ protein classification M any types of sequential symbolic data possess structure that is (i) hierarchical and (ii) context-sensitive. Natural-language text and transcribed speech are prime examples of such data: a corpus of language consists of sentences defined over a finite lexicon of symbols such as words. Linguists traditionally analyze the sentences into recursively structured phrasal constituents (1); at the same time, a distributional analysis of partially aligned sentential contexts (2) reveals in the lexicon clusters that are said to correspond to various syntactic categories (such as nouns or verbs). Such structure, however, is not limited to the natural languages; recurring motifs are found, on a level of description that is common to all life on earth, in the base sequences of DNA that constitute the genome. We introduce an unsupervised algorithm that discovers hierarchical structure in any sequence data, on the basis of the minimal assumption that the corpus at hand contains partially overlapping strings at multiple levels of organization. In the linguistic domain, our algorithm has been successfully tested both on artificialgrammar output and on natural-language corpora such as ATIS (3), CHILDES (4), and the Bible (5). In bioinformatics, the algorithm has been shown to extract from protein sequences syntactic structures that are highly correlated with the functional properties of these proteins. The ADIOS Algorithm for Grammar-Like Rule InductionIn a machine learning paradigm for grammar induction, a teacher produces a sequence of strings generated by a grammar G 0 , and a learner uses the resulting corpus to construct a grammar G, aiming to approximate G 0 in some sense (6). Recent evidence suggests that natural language acquisition involves both statistical computation (e.g., in speech segmentation) and rule-like algebraic processes (e.g., in structured generalization) (7-11). Modern computatio...
We derive and discuss the finite-energy sum rules, which form consistency conditions imposed by analyticity on the Regge analysis of a scattering amplitude. Their finite form makes them particularly useful in practical applications. We discuss the various applications, emphasizing a new kind of bootstrap predicting the Regge parameters from low-energy data alone. We apply our methods to xiV charge exchange and are able to derive many interesting features of the high-energy amplitudes at various t. In particular, we establish the existence of zeros of the amplitudes and of additional p poles. On the basis of the finiteenergy sum rules and the analysis of the irN amplitudes, we present theoretical and experimental evidence that double counting is involved in the interference model, which adds direct-channel resonances to the exchanged Regge terms.
We propose a novel clustering method that is based on physical intuition derived from quantum mechanics. Starting with given data points, we construct a scale-space probability function. Viewing the latter as the lowest eigenstate of a Schrödinger equation, we use simple analytic operations to derive a potential function whose minima determine cluster centers. The method has one parameter, determining the scale over which cluster structures are searched. We demonstrate it on data analyzed in two dimensions (chosen from the eigenvectors of the correlation matrix). The method is applicable in higher dimensions by limiting the evaluation of the Schrödinger potential to the locations of data points.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.