Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Wang, Shufan; Thompson, Laure; Iyyer, Mohit

doi:10.48550/arxiv.2109.06304

Cited by 6 publications

(7 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Importantly, the static-equivalent embeddings produced from contextualized embeddings can be utilized in identical ways as those from older Word2Vec or GloVe models, and also outperform them [9]. Subsequently, novel methods of creating static-equivalents have been described, using continuous bag-of-word approaches [23], phrases [24] and by combining contextual and static embeddings [25], for example. Nevertheless, this study has demonstrated as the n A Proposed Knowledge Discovery Method Utilizing Contextual Word Embeddings Based upon the results of this study, a working knowledge-discovery framework utilizing BERT can be achieved by:…”

Section: Discussionmentioning

confidence: 99%

Biomedical Knowledge Discovery From Unstructured Text Corpora Using Contextual Word Embeddings

Panesar

2022

Preprint

View full text Add to dashboard Cite

Background: Unsupervised extraction of knowledge from large, unstructured text corpora presents a challenge. Mathematical word embeddings taken from static language models such as Word2Vec have been utilized to discover "latent knowledge" within such domain-specific corpora. Here, semantic-similarity measures between representations of concepts or entities were used to predict relationships, which were later verified using domain-specific scientific techniques. Static language models have recently been surpassed at most downstream tasks by pre-trained, contextual language models like BERT. Some have postulated that contextualized embeddings potentially yield word representations superior to static ones for knowledge-discovery purposes. To address this question, two biomedically-trained BERT models (BioBERT and SciBERT) were used to encode n = 500, 1000 or 5000 sentences containing words of interest extracted from a biomedical corpus. The n representations for the words of interest were subsequently extracted and then aggregated to yield static-equivalent word representations for words belonging to vocabularies of biomedical intrinsic benchmarking tools for verbs and nouns. Using intrinsic benchmarking tasks, feasibility of using contextualized word representations for knowledge discovery tasks can be assessed: Word representations better encoding described reality are expected to demonstrate superior performance. Results: The number of contextual examples used for aggregation had little effect on performance, however embeddings aggregated from shorter sequences outperformed those from longer ones. Performance also varied according to model used, with BioBERT demonstrating superior performance to static models for verbs, and SciBERT embeddings demonstrating superior performance to static embeddings for nouns. Neither model outperformed static models for both nouns and verbs. Moreover, performance varied according to model layer from which embeddings were extracted from, and depending upon whether a word was intrinsic to a particular model's vocabulary or required subword decomposition. Conclusions: Based on these results, static-equivalent embeddings obtained from contextual models may be superior to those from static models. Moreover, as n has little effect on embedding performance, a computationally efficient method of sampling a corpus for contextual examples and leveraging BERT's architecture to obtain word embeddings suitable for knowledge discovery tasks is described.

show abstract

Section: Discussionmentioning

confidence: 99%

Biomedical Knowledge Discovery From Unstructured Text Corpora Using Contextual Word Embeddings

Panesar

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We leverage the prior knowledge brought by the BERT model and develop an inferential mapping filer. It first embeds input sequences with arbitrary lengths into vectors with a fixed length [26]. This embedding process can also be understood as mapping textual sequences into a numeric hyperspace.…”

Section: Inferential Mappingmentioning

confidence: 99%

“…This embedding process can also be understood as mapping textual sequences into a numeric hyperspace. Based on studies in [26], this hyperspace is declared to manage to put phrases with similar inferential information closer. E.g., the euclidean distance between "at the gates" and "at the doors" after embedding is 8.805, however, the distance between "at the gates" and "on the campus" is 14.95.…”

Section: Inferential Mappingmentioning

confidence: 99%

CitySpec with Shield: A Secure Intelligent Assistant for Requirement Formalization

Chen¹,

Li²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

An increasing number of monitoring systems have been developed in smart cities to ensure that a city's real-time operations satisfy safety and performance requirements. However, many existing city requirements are written in English with missing, inaccurate, or ambiguous information. There is a high demand for assisting city policymakers in converting human-specified requirements to machine-understandable formal specifications for monitoring systems. To tackle this limitation, we build CitySpec [1], the first intelligent assistant system for requirement specification in smart cities. To create CitySpec, we first collect over 1,500 real-world city requirements across different domains (e.g., transportation and energy) from over 100 cities and extract city-specific knowledge to generate a dataset of city vocabulary with 3,061 words. We also build a translation model and enhance it through requirement synthesis and develop a novel online learning framework with shielded validation. The evaluation results on real-world city requirements show that CitySpec increases the sentence-level accuracy of requirement specification from 59.02% to 86.64%, and has strong adaptability to a new city and a new domain (e.g., the F1 score for requirements in Seattle increases from 77.6% to 93.75% with online learning). After the enhancement from the shield function, CitySpec is now immune to most known textual adversarial inputs (e.g., the attack success rate of DeepWordBug [2] after the shield function is reduced to 0% from 82.73%). We test the CitySpec with 18 participants from different domains. CitySpec shows its strong usability and adaptability to different domains, and also its robustness to malicious inputs.

show abstract

“…Phrase-BERT establishes "phrase embeddings" to establish a vectorial, semantic "understanding" between the word, segment, and sentence levels. 16 BERTopic leverages term frequency-inverse document frequency (TF-IDF) and maximal marginal relevance algorithms to cluster semantically similar reports. 17 Figure 6 visually depicts topic modeling output using BERTopic.…”

Section: Option 1: Cluster Then Classifymentioning

confidence: 99%

Artificial (military) intelligence: enabling decision dominance through machine learning

Hanratty

2023

Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications V

View full text Add to dashboard Cite

Decision dominance-a Commander's ability to see, think, understand, and act first-becomes exponentially more important in multi-domain operations (MDO) where the lethality and pace of warfare increase. At the heart of decision dominance is intelligence analysis. Intelligence analysis provides an understanding of the threat and environment that forms the foundation for Commanders' decisions throughout planning and execution. Artificial intelligence and machine learning (AI/ML) offer opportunities to automate portions of the intelligence process. Accordingly, this paper explores mechanisms to employ machine learning (ML) techniques to rapidly synthesize heterogeneous text data into knowledge that can be graphically depicted in a situation template (SITEMP). Specifically, the paper examines the feasibility of two approaches. The first leverages hierarchical, agglomerative clustering with subsequent classification. This approach is analogous to how an analyst unfamiliar with the environment would operate-seeking patterns in the data to develop a fused understanding. The second approach applies a few-shot learning methodology that is akin to an analyst recognizing reporting based on prior experience. While AI/ML is not a panacea, it does promise significant gains to the competitor that can leverage it first and most fully. Ultimately, this paper seeks to inspire novel applications of AI/ML technologies to combat the challenges expected in MDO.

show abstract

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Cited by 6 publications

References 28 publications

Biomedical Knowledge Discovery From Unstructured Text Corpora Using Contextual Word Embeddings

Biomedical Knowledge Discovery From Unstructured Text Corpora Using Contextual Word Embeddings

CitySpec with Shield: A Secure Intelligent Assistant for Requirement Formalization

Artificial (military) intelligence: enabling decision dominance through machine learning

Contact Info

Product

Resources

About