Measures of similarity between diseases have been used for applications from discovering drug-target interactions to identifying disease-gene relationships. It is challenging to quantitatively compare diseases because much of what we know about them is captured in free text descriptions. Here we present an application of Latent Dirichlet Allocation as a way to measure similarity between diseases using textual descriptions. We learn latent topic representations of text from Online Mendelian Inheritance in Man records and use them to compute similarity. We assess the performance of this approach by comparing our results to manually curated relationships from the Disease Ontology. Despite being unsupervised, our model recovers a record's curated Disease Ontology relations with a mean Receiver Operating Characteristic Area Under the Curve of 0.80.With low dimensional models, topics tend to represent higher level information about affected organ systems, while higher dimensional models capture more granular genetic and phenotypic information.We examine topic representations of diseases for mapping concepts between ontologies and for tagging existing text with concepts. We conclude topic modeling on disease text leads to a robust approach to computing similarity that does not depend on keywords or ontology. KeywordsLatent Dirichlet Allocation; Disease Similarity; Topic Modeling; Ontology; Text Mining; Online Mendelian Inheritance in Man IntroductionMeasures of disease similarity have been used in drug repositioning [1], drug target selection [2], and understanding disease etiology [3]. These measurements seek to make maximal use of existing biomedical research by forming hypotheses based on the similarity between a well-studied or treatable disease and one that is less well-understood. One major limitation in turning this research into insight is that the majority of disease knowledge exists in the form of unstructured free text. Since it is not easy to parse complex meaning from technical natural language, many ontologies have been constructed to classify and organize diseases.Some researchers have used these ontologies to measure semantic or functional disease similarity [4,5] using resources like Gene Ontology [6], HumanNet [7], and Disease Ontology [8]. These approaches are based on overlap between gene sets [4] or distance in within the ontology's hierarchy [5]. However when using an ontology to calculate similarity directly, there are a number of limitations. First, you are confined to the scope of the ontology. Most ontologies are structured in such a way that they only capture one aspect of a disease. For example, one ontology might organize diseases by affected organ system, another by disease basis, and a third by genes associated.. CC-BY 4.0 International license It is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint (which . http://dx.doi.org/10.1101/030593 doi: bioRxiv ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.