Abstract. With the tremendous increase in the amount of biological literature, developing automated methods for extracting big data from papers, building models and explaining big mechanisms becomes a necessity. We describe here our approach to translating machine reading outputs, obtained by reading biological signaling literature, to discrete models of cellular networks. We use outputs from three different reading engines, and describe our approach to translating their different features, using examples from reading cancer literature. We also outline several issues that still arise when assembling cellular network models from state-of-the-art reading engines. Finally, we illustrate the details of our approach with a case study in pancreatic cancer.
IntroductionBiological knowledge is voluminous; it is nearly impossible to read all scientific papers on a single topic such as cancer. When building a model of a particular biological system, one example being cancer microenvironment, researchers usually start by searching for existing relevant models and by looking for information about system components and their interactions in published literature. Although there have been attempts to automate the process of model building [1, 2], most often modelers conduct these steps manually, with multiple iterations between (i) information extraction, (ii) model assembly, (iii) model analysis, and (iv) model validation through comparison with most recently published results. To allow for rapidly modeling the complexity of diseases like cancer, and for efficiently using ever-increasing amount of information in published work, we need representation standards and interfaces such that these tasks can be automated. This, in turn, will allow researchers to ask informed, interesting questions that can improve our understanding of health and disease.The systems biology community has designed and proposed a standardized language for representing biological models is the systems biology markup language (SBML), which allows for using different software tools without the need for recreating models specific for each tool and allows also for sharing the built models between the different research groups [3]. However, the SBML standard is not easily understood by biologists who create mechanistic models. Therefore, software tools have been developed to provide biologists with an interface that allows them to focus on the modeling tasks by hiding the details of the SBML language [4][5][6][7].To this end, the contributions of the work presented in this paper include: • A representation format that is straightforward to use by both machines and humans, and allows for efficient synthesis of models from big data in literature.• An approach to effectively use state-of-the-art machine reading output to create executable discrete models of cellular signaling.