In this paper, we analyze the model proposed in García and Londoño1 in which a set of p‐independent sequences of discrete time Markov chains is considered, over a finite alphabet A and with finite order o. The model is obtained identifying the states on the state space Ao where two or more sequences share the same transition probabilities (see also García and González‐López2). This identification establishes a partition on {1,…,p}×Ao, the set of sequences, and the state space. We show that by means of the Bayesian information criterion (BIC), the partition can be estimated eventually almost surely. Also, in García and Londoño,1 it is given a notion of divergence, derived from the BIC, which serves to identify the proximity/discrepancy between elements of {1,…,p}×Ao (see also García et al3). In the present article, we prove that this notion is a metric in the space where the model is built and that it is statistically consistent to determine proximity/discrepancy between the elements of the space {1,…,p}×Ao. We apply the notions discussed here for the construction of a parsimonious model that represents the common stochastic structure of 153 complete genomic Zika sequences, coming from tropical and subtropical regions.
In this paper, we classify by representativeness the elements of a set of complete genomic sequences of Dengue Virus Type 1 (DENV-1), corresponding to the outbreak in Japan during 2014. The set is coming from four regions: Chiba, Hyogo, Shizuoka and Tokyo. We consider this set as composed of independent samples coming from Markovian processes of finite order and finite alphabet. Under the assumption of the existence of a law that prevails in at least 50% of the samples of the set, we identify the sequences governed by the predominant law (see [1, 2]). The rule of classification is based on a local metric between samples, which tends to zero when we compare sequences of identical law and tends to infinity when comparing sequences with different laws. We found that the order of representativeness, from highest to lowest and according to the origin of the sequences is: Tokyo, Chiba, Hyogo, and Shizuoka. When comparing the Japanese sequences with their contemporaries from Asia, we find that the less representative sequence (from Shizuoka) is positioned in groups considerably far away from that which includes the sequences from the other regions in Japan, this offers evidence to suppose that the outbreak in Japan could be produced by more than one type of DENV-1.
We build a profile of the Epstein-Barr virus (EBV) by means of genomic sequences obtained from patients with nasopharyngeal carcinoma (NPC). We consider a set of sequences coming from the NCBI free source and we assume that this set is a collection of independent samples of stochastic processes related by an equivalence relation. Given a collection {(Xjt)t∈ℤ}pj=1 of p independent discrete time Markov processes with finite alphabet A and state space S, we state that the elements (i, s) and (j, r) in {1, 2,…, p} × S are equivalent if and only if they share the same transition probability for all the elements in the alphabet. The equivalence allows to reduce the number of parameters to be estimated in the model avoiding to delete states of S to achieve that reduction. Through the equivalence relationship, we build the global profile for all the EBV in NPC sequences, this model allows us to represent the underlying and common stochastic law of the set of sequences. The equivalence classes define an optimal partition of {1, 2,…, p} × S, and it is in relation to this partition that we define the profile of the set of genomic sequences.
Palavras-chave: Medida de associação. Teste multivariado. Teste não-paramétrico de independência. Função de distribuição empírica. Dependência não-linear visível e oculta. Vector aleatório.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.