Contextual word embeddings for tabular data search and integration

Pilaluisa, José; Tomás, David; Navarro-Colorado, Borja; Mazón, José-Norberto

doi:10.1007/s00521-022-08066-8

Cited by 2 publications

(1 citation statement)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous work [51] has shown the superiority of contextual word embeddings, such as BERT and RoBERTa [52], over static word embeddings like Word2vec and fastText, as well as traditional information retrieval techniques such as BM25 [53]. For this reason, this evaluation focuses on five different language models featuring diverse architectures that produce contextual word embeddings: General Text Embeddings [56]: GTE models primarily rely on the BERT framework and currently come in three sizes: large, base, and small.…”

Section: Large Language Modelsmentioning

confidence: 99%

Leveraging Large Language Models for Sensor Data Retrieval

Berenguer,

Morejón,

Tomás

et al. 2024

Applied Sciences

Self Cite

View full text Add to dashboard Cite

The growing significance of sensor data in the development of information technology services finds obstacles due to disparate data presentations and non-adherence to FAIR principles. This paper introduces a novel approach for sensor data gathering and retrieval. The proposal leverages large language models to convert sensor data into FAIR-compliant formats and to provide word embedding representations of tabular data for subsequent exploration, enabling semantic comparison. The proposed system comprises two primary components. The first focuses on gathering data from sensors and converting it into a reusable structured format, while the second component aims to identify the most relevant sensor data to augment a given user-provided dataset. The evaluation of the proposed approach involved comparing the performance of various large language models in generating representative word embeddings for each table to retrieve related sensor data. The results show promising performance in terms of precision and MRR (0.90 and 0.94 for the best-performing model, respectively), indicating the system’s ability to retrieve pertinent sensor data that fulfil user requirements.

show abstract

Section: Large Language Modelsmentioning

confidence: 99%