Bayesian networks are probabilistic graphical models with a wide
range of application areas including gene regulatory networks inference, risk
analysis and image processing. Learning the structure of a Bayesian
network (BNSL) from discrete data is known to be an NP-hard task with
a superexponential search space of directed acyclic graphs. In this
work, we propose a new polynomial time algorithm for discovering a
subset of all possible cluster cuts, a greedy algorithm for
approximately solving the resulting linear program, and a generalized
arc consistency algorithm for the acyclicity constraint. We embed
these in the constraint programming-based branch-and-bound solver
CPBayes and show that, despite being suboptimal, they improve
performance by orders of magnitude. The resulting solver also compares
favorably with GOBNILP, a state-of-the-art solver for the BNSL
problem which solves an NP-hard problem to discover each cut and
solves the linear program exactly.
Motivation
Inferring gene regulatory networks in non-independent genetically-related panels is a methodological challenge. This hampers evolutionary and biological studies using heterozygote individuals such as in wild sunflower populations or cultivated hybrids.
Results
First, we simulated 100 datasets of gene expressions and polymorphisms, displaying the same gene expression distributions, heterozygosities and heritabilities as in our dataset including 173 genes and 353 genotypes measured in sunflower hybrids. Secondly, we performed a meta-analysis based on six inference methods (Lasso, Random Forests, Bayesian Networks, Markov Random Fields, Ordinary Least Square and Findr) and selected the minimal density networks for better accuracy with 64 edges connecting 79 genes and 0.35 AUPR score on average. We identified that triangles and mutual edges are prone to errors in the inferred networks. Applied on classical datasets without heterozygotes, our strategy produced a 0.65 AUPR score for one dataset of the DREAM5 Systems Genetics Challenge. Finally, we applied our method to an experimental dataset from sunflower hybrids. We successfully inferred a network composed of 105 genes connected by 106 putative regulations with a major connected component.
Availability
Our inference methodology dedicated to genomic and transcriptomic data is available at https://forgemia.inra.fr/sunrise/inference_methods.
Supplementary information
The data are available in the Data INRAE, at https://doi.org/10.15454/vrgwz2 (simulated datasets and also the output of meta-analysis) and https://doi.org/10.15454/HESVA0 (experimental sunflower dataset), and the complete descriptions of the inference methods used by the meta-analysis, the gene selection procedure related to drought and heterosis are available online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.