Objective
As COVID-19 started its rapid emergence and gradually transformed into an unprecedented pandemic, the need for having a knowledge repository for the disease became crucial. To address this issue, a new COVID-19 machine readable dataset known as COVID-19 Open Research Dataset (CORD-19) has been released. Based on this, our objective was to build a computable co-occurrence network embeddings to assist association detection amongst COVID-19 related biomedical entities.
Materials and Methods
Leveraging a Linked Data version of CORD-19 (i.e., CORD-19-on-FHIR), we first utilized SPARQL to extract co-occurrences among chemicals, diseases, genes, and mutations and build a co-occurrence network. We then trained the representation of the derived co-occurrence network using node2vec with four edge embeddings operations (L1, L2, Average, and Hadamard). Six algorithms (Decision Tree, Linear Regression, Support Vector Machine, Random Forest, Naive Bayes, and Multi-layer Perceptron) were applied to evaluate performance on link prediction. An unsupervised learning strategy was also developed incorporating the t-SNE and DBSCAN algorithms for case studies.
Results
Random Forest classifier showed the best performance on link prediction across different network embeddings. For edge embeddings generated using the Average operation, Random Forest achieved the optimal average precision of 0.97 and F1 score of 0.90. For unsupervised learning, 63 clusters were formed with silhouette score of 0.128. Significant associations were detected for five coronavirus infectious diseases in their corresponding subgroups.
Conclusion
In this study, we constructed COVID-19-centered co-occurrence network embeddings. Results indicated that the generated embeddings were able to extract significant associations for COVID-19 and coronavirus infectious diseases.