Background
Review of scientific literature is a time consuming but fundamental step in any kind of scientific research. A consistent manual filtering of papers is always necessary in order to evaluate their relevance with respect to the topic of interest, as the sorting provided by most common research engines is rarely efficient in terms of matching with the desired contents.
Purpose
The aim of this study was to develop, and validate versus manual analysis, an automated tool for performing an efficient search through medical scientific literature, according to keywords relevant to the application of specific technologies in the field of cardiology.
Methods
Using this multiplatform tool implemented in Python, PyQt5 library, the user is required to insert a list of keywords, from which all the possible search strings were built by connecting them with logical operators. The algorithm automatically queries the on-line database PubMed (NCBI) and downloads all the resulting abstracts, with titles and keywords. Results related to the field of cardiology are identified counting the occurrences of “marker” words collected in a dedicated dictionary, developed on the base of the Unified Medical Language System (U.S. NLM). Then, a search-specific dictionary is automatically developed according to the statistical distribution of words in the texts of abstracts, titles and keywords and weighting them according to their relative frequency (ratio between occurrences and number of considered papers). Finally, for each paper the occurrences of these “marker” words are counted and a matching-probability score is assigned, providing a sorting of the results according to expected matching with the topic of interest, together with a threshold-based binary classification.
In order to validate the algorithm, three different technologies with potential applications in cardiology were considered: smartphone applications (App), machine learning (ML) and virtual reality (VR). The related dictionaries were developed with the dedicated function embedded in the tool, while, for the validation of the results, a dataset of 461 manually-classified abstracts was considered, and algorithm thresholds were iteratively adjusted on the base of validation results.
Results
The algorithm applied to the validation dataset showed an overall accuracy (acc) of 88.5% (sensitivity (se) 85.78%, specificity (sp) 91.27%) in the identification of cardiology papers, while the results for the three inspected technologies were:
App: acc 90.89% (se 92.16%, sp 90.53%)
ML: acc 82.65% (se 94.06%, sp 79.44%)
VR: acc 91.54% (se 96%, sp 90.3%)
The algorithm can process 5000 abstracts in around 2 hours.
Conclusions
Results of the validation revealed that the proposed approach is highly valuable in speeding-up any search of medical literature focused on a specific technology or application, enabling a quick overview regarding its diffusion and maturity in a specific scientific domain.
Algorithm schema
Funding Acknowledgement
Type of funding source: None