Minimum message length encoding is a technique of inductive inference with theoretical and practical advantages. It allows the posterior odds-ratio of two theories or hypotheses to be calculated. Here it is applied to problems of aligning or relating two strings, in particular two biological macromolecules. We compare the r-theory, that the strings are related, with the null-theory, that they are not related. If they are related, the probabilities of the various alignments can be calculated. This is done for one-, three-, and five-state models of relation or mutation. These correspond to linear and piecewise linear cost functions on runs of insertions and deletions. We describe how to estimate parameters of a model. The validity of a model is itself an hypothesis and can be objectively tested. This is done on real DNA strings and on artificial data. The tests on artificial data indicate limits on what can be inferred in various situations. The tests on real DNA support either the three- or five-state models over the one-state model. Finally, a fast, approximate minimum message length string comparison algorithm is described.
Motivation: Secondary structure underpins the folding pattern and architecture of most proteins. Accurate assignment of the secondary structure elements is therefore an important problem. Although many approximate solutions of the secondary structure assignment problem exist, the statement of the problem has resisted a consistent and mathematically rigorous definition. A variety of comparative studies have highlighted major disagreements in the way the available methods define and assign secondary structure to coordinate data.Results: We report a new method to infer secondary structure based on the Bayesian method of minimum message length inference. It treats assignments of secondary structure as hypotheses that explain the given coordinate data. The method seeks to maximize the joint probability of a hypothesis and the data. There is a natural null hypothesis and any assignment that cannot better it is unacceptable. We developed a program based on this approach and compared it with popular programs, such as DSSP and among others. Our evaluation suggests that gives reliable assignments even on low-resolution structures.Availability:
http://www.csse.monash.edu.au/~karun/sstContact:
arun.konagurthu@monash.edu (or lloyd.allison@monash.edu)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.