Background: Unsupervised extraction of knowledge from large, unstructured text corpora presents a challenge. Mathematical word embeddings taken from static language models such as Word2Vec have been utilized to discover "latent knowledge" within such domain-specific corpora. Here, semantic-similarity measures between representations of concepts or entities were used to predict relationships, which were later verified using domain-specific scientific techniques. Static language models have recently been surpassed at most downstream tasks by pre-trained, contextual language models like BERT. Some have postulated that contextualized embeddings potentially yield word representations superior to static ones for knowledge-discovery purposes. To address this question, two biomedically-trained BERT models (BioBERT and SciBERT) were used to encode n = 500, 1000 or 5000 sentences containing words of interest extracted from a biomedical corpus. The n representations for the words of interest were subsequently extracted and then aggregated to yield static-equivalent word representations for words belonging to vocabularies of biomedical intrinsic benchmarking tools for verbs and nouns. Using intrinsic benchmarking tasks, feasibility of using contextualized word representations for knowledge discovery tasks can be assessed: Word representations better encoding described reality are expected to demonstrate superior performance.
Results: The number of contextual examples used for aggregation had little effect on performance, however embeddings aggregated from shorter sequences outperformed those from longer ones. Performance also varied according to model used, with BioBERT demonstrating superior performance to static models for verbs, and SciBERT embeddings demonstrating superior performance to static embeddings for nouns. Neither model outperformed static models for both nouns and verbs. Moreover, performance varied according to model layer from which embeddings were extracted from, and depending upon whether a word was intrinsic to a particular model's vocabulary or required subword decomposition.
Conclusions: Based on these results, static-equivalent embeddings obtained from contextual models may be superior to those from static models. Moreover, as n has little effect on embedding performance, a computationally efficient method of sampling a corpus for contextual examples and leveraging BERT's architecture to obtain word embeddings suitable for knowledge discovery tasks is described.