Identifying exceptional motifs is often used for extracting information from long DNA sequences. The two difficulties of the method are the choice of the model that defines the expected frequencies of words and the approximation of the variance of the difference T(W) between the number of occurrences of a word W and its estimation. We consider here different Markov chain models, either with stationary or periodic transition probabilities. We estimate the variance of the difference T(W) by the conditional variance of the number of occurrences of W given the oligonucleotides counts that define the model. Two applications show how to use asymptotically standard normal statistics associated with the counts to describe a given sequence in terms of its outlying words. Sequences of Escherichia coli and of Bacillus subtilis are compared with respect to their exceptional tri- and tetranucleotides. For both bacteria, exceptional 3-words are mainly found in the coding frame. E. coli palindrome counts are analyzed in different models, showing that many overabundant words are one-letter mutations of avoided palindromes.
SUMMARY
Considering a Markov chain model for deoxyribonucleic acid sequences, this paper proposes two asymptotically normal statistics to test whether the frequency of a given word is concordant with the first‐order Markov chain model or not. The problem is to choose estimates μ̂(W) of the expectation of the frequency Mw of a word W in the observed sequence such that the asymptotic variance of MW−μ̂(W) is easily computable. The first estimator is derived from the frequency of W[– 1], which is W with its last letter deleted. The second, following an idea of Cowan, is the conditional expectation Mw given the observed frequencies of all two‐letter words. Two examples on phage lambda and phage T7 are shown.
Science policy is increasingly shifting towards an emphasis in societal problems or grand challenges. As a result, new evaluative tools are needed to help assess not only the knowledge production side of research programmes or organisations, but also the articulation of research agendas with societal needs. In this paper, we present an exploratory investigation of science supply and societal needs on the grand challenge of obesity -an emerging health problem with enormous social costs. We illustrate a potential approach that uses topic modelling to explore: (a) how scientific publications can be used to describe existing priorities in science production; (b) how policy records (in this case here questions posed in the European parliament) can be used as an instance of mapping discourse of social needs; (c) how the comparison between the two may show (mis)alignments between societal concerns and scientific outputs. While this is a technical exercise, we propose that this type of mapping methods can be useful to domain experts for informing strategic planning and evaluation in funding agencies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.