The starting point of this article is the question "How to retrieve fingerprints of rhythm in written texts?" We address this problem in the case of Brazilian and European Portuguese. These two dialects of Modern Portuguese share the same lexicon and most of the sentences they produce are superficially identical. Yet they are conjectured, on linguistic grounds, to implement different rhythms. We show that this linguistic question can be formulated as a problem of model selection in the class of variable length Markov chains. To carry on this approach, we compare texts from European and Brazilian Portuguese. These texts are previously encoded according to some basic rhythmic features of the sentences which can be automatically retrieved. This is an entirely new approach from the linguistic point of view. Our statistical contribution is the introduction of the smallest maximizer criterion which is a constant free procedure for model selection. As a by-product, this provides a solution for the problem of optimal choice of the penalty constant when using the BIC to select a variable length Markov chain. Besides proving the consistency of the smallest maximizer criterion when the sample size diverges, we also make a simulation study comparing our approach with both the standard BIC selection and the Peres-Shields order estimation. Applied to the linguistic sample constituted for our case study, the smallest maximizer criterion assigns different context-tree models to the two dialects of Portuguese. The features of the selected models are compatible with current conjectures discussed in the linguistic literature.
Abstract:The Partition Markov Model characterizes the process by a partition L of the state space, where the elements in each part of L share the same transition probability to an arbitrary element in the alphabet. This model aims to answer the following questions: what is the minimal number of parameters needed to specify a Markov chain and how to estimate these parameters. In order to answer these questions, we build a consistent strategy for model selection which consist of: giving a size n realization of the process, finding a model within the Partition Markov class, with a minimal number of parts to represent the process law. From the strategy, we derive a measure that establishes a metric in the state space. In addition, we show that if the law of the process is Markovian, then, eventually, when n goes to infinity, L will be retrieved. We show an application to model internet navigation patterns.
In this paper, we propose a procedure of selecting samples from a set of samples coming from Markovian processes of finite order and finite alphabet. Under the assumption of the existence of a law that prevails in at least q% of the samples of the collection, we show that the procedure allows to identify samples governed by the predominant law. The approach is based on a local metric between samples, which tends to zero when we compare samples of identical law and tends to infinity when comparing samples with different laws. The local metric allows to define a criterion which takes arbitrarily large values when the previous assumption about the existence of a predominant law does not hold. By means of this procedure, we map similarities and dissimilarities of some Brazilian stocks' daily trading volume dynamic.
In this paper, we analyze the model proposed in García and Londoño1 in which a set of p‐independent sequences of discrete time Markov chains is considered, over a finite alphabet A and with finite order o. The model is obtained identifying the states on the state space Ao where two or more sequences share the same transition probabilities (see also García and González‐López2). This identification establishes a partition on {1,…,p}×Ao, the set of sequences, and the state space. We show that by means of the Bayesian information criterion (BIC), the partition can be estimated eventually almost surely. Also, in García and Londoño,1 it is given a notion of divergence, derived from the BIC, which serves to identify the proximity/discrepancy between elements of {1,…,p}×Ao (see also García et al3). In the present article, we prove that this notion is a metric in the space where the model is built and that it is statistically consistent to determine proximity/discrepancy between the elements of the space {1,…,p}×Ao. We apply the notions discussed here for the construction of a parsimonious model that represents the common stochastic structure of 153 complete genomic Zika sequences, coming from tropical and subtropical regions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.