Measures of similarity between diseases have been used for applications from discovering drug-target interactions to identifying disease-gene relationships. It is challenging to quantitatively compare diseases because much of what we know about them is captured in free text descriptions. Here we present an application of Latent Dirichlet Allocation as a way to measure similarity between diseases using textual descriptions. We learn latent topic representations of text from Online Mendelian Inheritance in Man records and use them to compute similarity. We assess the performance of this approach by comparing our results to manually curated relationships from the Disease Ontology. Despite being unsupervised, our model recovers a record's curated Disease Ontology relations with a mean Receiver Operating Characteristic Area Under the Curve of 0.80.With low dimensional models, topics tend to represent higher level information about affected organ systems, while higher dimensional models capture more granular genetic and phenotypic information.We examine topic representations of diseases for mapping concepts between ontologies and for tagging existing text with concepts. We conclude topic modeling on disease text leads to a robust approach to computing similarity that does not depend on keywords or ontology.
KeywordsLatent Dirichlet Allocation; Disease Similarity; Topic Modeling; Ontology; Text Mining; Online Mendelian Inheritance in Man
IntroductionMeasures of disease similarity have been used in drug repositioning [1], drug target selection [2], and understanding disease etiology [3]. These measurements seek to make maximal use of existing biomedical research by forming hypotheses based on the similarity between a well-studied or treatable disease and one that is less well-understood. One major limitation in turning this research into insight is that the majority of disease knowledge exists in the form of unstructured free text. Since it is not easy to parse complex meaning from technical natural language, many ontologies have been constructed to classify and organize diseases.Some researchers have used these ontologies to measure semantic or functional disease similarity [4,5] using resources like Gene Ontology [6], HumanNet [7], and Disease Ontology [8]. These approaches are based on overlap between gene sets [4] or distance in within the ontology's hierarchy [5]. However when using an ontology to calculate similarity directly, there are a number of limitations. First, you are confined to the scope of the ontology. Most ontologies are structured in such a way that they only capture one aspect of a disease. For example, one ontology might organize diseases by affected organ system, another by disease basis, and a third by genes associated.. CC-BY 4.0 International license It is made available under a was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint (which . http://dx.doi.org/10.1101/030593 doi: bioRxiv ...