HIV acts by attacking the immune system and gradually destroying the TCD4+ defense cells. Without adequate treatment, the carriers develop the most severe form of the infection, AIDS, when the patient can be afflicted by opportunistic diseases that inevitably lead to death. Fortunately, with the advent of the highly active antiretroviral therapy (HAART), the mortality of people with HIV is decreasing. However, mutations can occur in the genotype of the virus, generating drug-resistant phenotypes. Computational methods have been used to predict whether a given strain is drug-resistant, and to which drugs this resistance occurs, thereby increasing the chances of success of the prescribed treatment regimen. However, these methods are not always accurate in their task. In this context, by applying Feature Selection methods and estimating Decision Tree models, we investigated patterns in Protease and Reverse Transcriptase enzyme sequences, as well as in patients' clinical data, which can lead to correct or incorrect computational prediction. As a result, we identified 21 features that are highly informative, 11 which tend to lead the methods to error, and eight that present both behaviors simultaneously, being able to predict the patient's response to therapy and at the same time may lead the predictor's methods to failure.
Haplotype information has a central role in the understanding and diagnosis of certain illnesses, and also for evolution studies. Since that type of information is hard to obtain directly, computational methods to infer haplotype from genotype data have received great attention from the computational biology community. Unfortunately, haplotype inference is a very hard computational biology problem and the existing methods can only partially identify correct solutions. I present neural network models that use different properties of the data to predict when a method is more prone to make errors. I construct models for three different Haplotype Inference approaches and I show that our models are accurate and statistically relevant. The results of our experiments offer valuable insights on the performance of those methods, opening opportunity for a combination of strategies or improvement of individual approaches. I formally demonstrate that Linkage Disequilibrium (LD) and heterozygosity are very strong indicators of Switch Error tendency for four methods studied, and I delineate scenarios based on LD measures, that reveal a higher or smaller propension of the HI methods to present inference errors, so the correlation between LD and the occurrence of errors varies among regions along the genotypes. I present evidence that considering windows of length 10, immediately to the left of a SNP (upstream region), and eliminating the non-informative SNPs through Fisher's Test leads to a more suitable correlation between LD and Inference Errors. I apply Multiple Linear Regression to explore the relevance of several biologically meaningful properties of the genotype sequences for the accuracy of the haplotype inference results, developing models for two databases (considering only Humans) and using two error metrics. The accuracy of our results and the stability of our proposed models are supported by statistical evidence.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.