Robust semantic text similarity using LSA, machine learning, and linguistic resources

Kashyap, Abhay L.; Han, Lushan; Yus, Roberto; Sleeman, Jennifer; Satyapanich, Taneeya W.; Gandhi, Sunil; Finin, Tim

doi:10.1007/s10579-015-9319-2

Cited by 38 publications

(23 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This in turn was followed by a classification or regression to get the similarity score (Agirrea et al, 2015). Most of the distributional semantic representation based on dimensionality reduction algorithms (Han et al, 2013;Kashyap et al, 2015) and word embedding models were based on deep learning (Kenter and de Rijke, 2015;Wu et al, 2014;Socher et al, 2011).…”

Section: Related Workmentioning

confidence: 99%

“…Web corpus from the Stanford WebBase 1 project utilized to build the distributional semantic word representation and then the model was enhanced by integrating POS with WordNet. Same system was then extended to the Multilingual Semantic Textual Similarity and Cross Level Semantic Similarity in SemEval 2014 with few external resources (Google translate 2 , Wordnik 3 , and bing 4 ) and showed greater accuracy (Kashyap et al, 2015 In order to represent the sentence pair, high quality word embedding was obtained using Word2Vec and Glove. Further feature vectors of length 60 computed using feature functions and evaluated on Microsoft Research Paraphrase Corpus (MSRP) (Kenter and de Rijke, 2015).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Amrita_CEN at SemEval-2016 Task 1: Semantic Relation from Word Embeddings in Higher Dimension

Ganesh

Kumar

2016

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

View full text Add to dashboard Cite

Semantic Textual Similarity measures similarity between pair of texts, even though the similar context is projected using different words. This work attempted to incorporate the context space of the sentence from that sentence alone. It proposes combination of Word2Vec and Non-Negative Matrix Factorization to represent the sentence as context embedding vector in context space. Distance and correlation values between context embedding vector pairs used as a features for Support Vector Regression to built the domain independent similarity measuring model. The proposed model yielding performance 0.41 in terms of correlation.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Amrita_CEN at SemEval-2016 Task 1: Semantic Relation from Word Embeddings in Higher Dimension

Ganesh

Kumar

2016

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

View full text Add to dashboard Cite

show abstract

“…A crescent number of works in STS literature rely on the use of resources such as WordNet, FrameNet and VerbNet for integrating some linguistic relationships to the STS process (Al-Alwani, 2015;Yousif et al, 2015;Brychcín and Svoboda, 2016;Ferreira et al, 2016;Kashyap et al, 2016;Ferreira et al, 2018). As a complement aspect, probabilistic-based techniques, as we can see in the Vector Space Models (VSM) has been motivating studies about its advantages, such as domain independence and the ability to automatically obtain some of the semantic relations between sentences considering a space of contexts (Hartmann, 2016;Barbosa et al, 2016;Freire et al, 2016).…”

Section: Introductionmentioning

confidence: 99%

Enhancing Brazilian Portuguese Textual Entailment Recognition with a Hybrid Approach

Silva¹,

Rigo²

2018

Journal of Computer Science

View full text Add to dashboard Cite

Previous work on textual entailment has not fully exploited aspects of deep linguistic relations, which have been shown as containing important information for entailment identification. In this study, we present a new method to compute semantic textual similarity between two sentences. Our proposal relies on the integration of a set of deep linguistic relations, lexical aspects and distributed representational resources. We used our method with a large set of annotated data available from the ASSIN Workshop in the PROPOR 2016 event. The achieved results score among the best-known results in the literature. A perceived advantage of our approach is the ability to generate good results even with a small corpus on training tasks.

show abstract

“…7 There is a growing number of curation strategies supported by Natural 8 Language Processing (NLP) and Machine Learning (ML) tasks, and they 9 have become a key source of information for bioinformatics repositories. 10 Protein-protein interactions, regulatory interactions identification, entity 11 association to ontologies, or even directed searches, are just some of aided 12 curation examples [6,17,23]. It is important to emphasize that they are 13 focused in facilitating access to specific information patterns.…”

mentioning

confidence: 99%

“…Thus, it is the user 112 who decides what to focus on: a closer meaning to the original sentence 113 (higher score), or a more broader context similarity (a lower score). 114 We decided to apply a frequently used strategy consisting in applying 115 several metrics which measure different similarity aspects of both sentences 116 and then combine the scores into a single one [10,21]. This strategy has 117 proved to be robust to contexts' changes.…”

mentioning

confidence: 99%

L-Regulon: A novel “soft-curation” approach supported by a semantic enriched reading for RegulonDB literature

Lithgow-Serrano

Gama-Castro

Ishida-Gutiérrez

et al. 2020

Preprint

View full text Add to dashboard Cite

Manual curation is a bottleneck in the processing of the vast amounts of knowledge present in the scientific literature in order to make such knowledge available in computational resources e.g., structured databases. Furthermore, the extraction of content is by necessity limited to the pre-defined concepts, features and relationships that conform to the model inherent in any knowledgebase. These pre-defined elements contrast with the rich knowledge that natural language is capable of conveying. Here we present a novel experiment of what we call "soft curation" supported by an ad-hoc tuned robust natural language processing development that quantifies semantic similarity across all sentences of a given corpus of literature. This underlying machine supports novel ways to navigate and read within individual papers as well as across papers of a corpus. As a first proof-of-principle experiment, we applied this approach to more than 100 collections of papers, selected from RegulonDB, that support knowledge of the regulation transcription initiation in E. coli K-12, resulting in L-Regulon (L for "linguistic") version 1.0. Furthermore, we have initiated the mapping of RegulonDB curated promoters, promoters, to their evidence sentence in the given publication. We believe this is the first step in a novel approach for users and curators, in order to increase the accessibility of knowledge in ways yet to be discovered. 1/23 447 interlinked collections, with 111 related to GENSOR Units plus 7 more. In 448 detail that means more than 800 articles, 233k sentences and almost 1.8 449 million of relationships (interlinks). This data also feeds the semantic 450 network and extractive summary features. This version also contains links 451 between 508 promoters in RegulonDB and their source sentences. 452 16/23

show abstract

Robust semantic text similarity using LSA, machine learning, and linguistic resources

Cited by 38 publications

References 33 publications

Amrita_CEN at SemEval-2016 Task 1: Semantic Relation from Word Embeddings in Higher Dimension

Amrita_CEN at SemEval-2016 Task 1: Semantic Relation from Word Embeddings in Higher Dimension

Enhancing Brazilian Portuguese Textual Entailment Recognition with a Hybrid Approach

L-Regulon: A novel “soft-curation” approach supported by a semantic enriched reading for RegulonDB literature

Contact Info

Product

Resources

About