SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and
            Crosslingual Focused Evaluation

Cer, Daniel; Diab, Mona; Agirre, Eneko; López-Gazpio, Iñigo; Specia, Lucia

doi:10.18653/v1/s17-2001

Cited by 1,162 publications

(877 citation statements)

References 68 publications

Supporting

Mentioning

869

Contrasting

Unclassified

Order By: Relevance

“…In order to evaluate our systems and monitor their performances, we have used four datasets drawn from the STS shared task SemEval-2017 (Task1: STS Cross-lingual Arabic-English) 4 [8], with a total of 2412 pairs of sentences. The sentence pairs have been manually labeled by five annotators, and the similarity score is the mean of the five annotators' judgments.…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…We compared our optimal results with the three best systems proposed in SemEval-2017 Arabic-English cross-lingual evaluation task [8] (ECNU [40], BIT [44] and HCTI [38]) and the baseline system [8]. In this evaluation, ECNU obtained the best performance with a correlation score of 74.93%, followed by BIT and HCTI with 70.07% and 68.36% respectively.…”

Section: Comparison With Semeval-2017 Winnersmentioning

confidence: 99%

See 1 more Smart Citation

Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences

Nagoudi

Ferrero

Schwab

et al. 2018

Communications in Computer and Information Science

View full text Add to dashboard Cite

Abstract. Semantic Textual Similarity (STS) is an important component in manyNatural Language Processing (NLP) applications, and plays an important role in diverse areas such as information retrieval, machine translation, information extraction and plagiarism detection. In this paper we propose two word embeddingbased approaches devoted to measuring the semantic similarity between ArabicEnglish cross-language sentences. The main idea is to exploit Machine Translation (MT) and an improved word embedding representations in order to capture the syntactic and semantic properties of words. MT is used to translate English sentences into Arabic language in order to apply a classical monolingual comparison. Afterwards, two word embedding-based methods are developed to rate the semantic similarity. Additionally, Words Alignment (WA), Inverse Document Frequency (IDF) and Part-of-Speech (POS) weighting are applied on the examined sentences to support the identification of words that are most descriptive in each sentence. The performances of our approaches are evaluated on a crosslanguage dataset containing more than 2400 Arabic-English pairs of sentence. Moreover, the proposed methods are confirmed through the Pearson correlation between our similarity scores and human ratings.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Section: Comparison With Semeval-2017 Winnersmentioning

confidence: 99%

Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences

Nagoudi

Ferrero

Schwab

et al. 2018

Communications in Computer and Information Science

View full text Add to dashboard Cite

show abstract

“…The dataset we selected to carry out this experiment is provided by the shared task on Semantic Text Similarity (STS) held at SemEval 2017 (task 1, track 5 English-English) (Cer et al, 2017). The set is composed of 250 pairs of short English sentences, manually annotated with a numerical score from 1 to 5 indicating their degree of semantic relatedness.…”

Section: Datamentioning

confidence: 99%

Measuring Frame Instance Relatedness

Basile

Condori

Cabrio

2018

Proceedings of the Seventh Joint Conference on Lexical And Computational Semantics

View full text Add to dashboard Cite

Frame semantics is a well-established framework to represent the meaning of natural language in computational terms. In this work, we aim to propose a quantitative measure of relatedness between pairs of frame instances. We test our method on a dataset of sentence pairs, highlighting the correlation between our metric and human judgments of semantic similarity. Furthermore, we propose an application of our measure for clustering frame instances to extract prototypical knowledge from natural language.

show abstract

“…We use LexVec as the counting model as it generally outperforms PPMI-SVD and GloVe on intrinsic and extrinsic evaluations (Salle et al, 2016a;Cer et al, 2017;Wohlgenannt et al, 2017;Konkol et al, 2017), but the method proposed here should transfer to GloVe unchanged.…”

Section: Introductionmentioning

confidence: 99%

Proceedings of the Second Workshop on Subword/Character LEvel Models

2018

View full text Add to dashboard Cite

ii Introduction Traditional NLP starts with a hand-engineered layer of representation, the level of tokens or words. A tokenization component first breaks up the text into units using manually designed rules. Tokens are then processed by components such as word segmentation, morphological analysis and multiword recognition. The heterogeneity of these components makes it hard to create integrated models of both structure within tokens (e.g., morphology) and structure across multiple tokens (e.g., multi-word expressions). This approach can perform poorly (i) for morphologically rich languages, (ii) for noisy text, (iii) for languages in which the recognition of words is difficult and (iv) for adaptation to new domains; and (v) it can impede the optimization of preprocessing in end-to-end learning.The workshop provides a forum for discussing recent advances as well as future directions on sub-word and character-level natural language processing and representation learning that address these problems. Topics of Interest:• tokenization-free models• character-level machine translation• character-ngram information retrieval• transfer learning for character-level models• models of within-token and cross-token structure• NL generation (of words not seen in training etc)• out of vocabulary words• morphology and segmentation• relationship b/w morphology and character-level models• stemming and lemmatization • inflection generation• orthographic productivity• form-meaning representations• true end-to-end learning AbstractNeural machine translation has achieved impressive results in the last few years, but its success has been limited to settings with large amounts of parallel data. One way to improve NMT for lower-resource settings is to initialize a word-based NMT model with pretrained word embeddings. However, rare words still suffer from lower quality word embeddings when trained with standard word-level objectives. We introduce word embeddings that utilize morphological resources, and compare to purely unsupervised alternatives. We work with Arabic, a morphologically rich language with available linguistic resources, and perform Ar-to-En MT experiments on a small corpus of TED subtitles. We find that word embeddings utilizing subword information consistently outperform standard word embeddings on a word similarity task and as initialization of the source word embeddings in a low-resource NMT system.

show abstract

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

Cited by 1,162 publications

References 68 publications

Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences

Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences

Measuring Frame Instance Relatedness

Proceedings of the Second Workshop on Subword/Character LEvel Models

Contact Info

Product

Resources

About