2018
DOI: 10.2139/ssrn.3259971
|View full text |Cite
|
Sign up to set email alerts
|

Text Similarity in Vector Space Models: A Comparative Study

Abstract: Automatic measurement of semantic text similarity is an important task in natural language processing. In this paper, we evaluate the performance of different vector space models to perform this task. We address the real-world problem of modeling patent-to-patent similarity and compare TFIDF (and related extensions), topic models (e.g., latent semantic indexing), and neural models (e.g., paragraph vectors). Contrary to expectations, the added computational cost of text embedding methods is justified only when:… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0
2

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 19 publications
(6 citation statements)
references
References 13 publications
0
4
0
2
Order By: Relevance
“…Each vector is determined via a term‐frequency inverse‐document‐frequency (TFIDF) approach that up‐weights rare words and down‐weights common words. Although the TFIDF approach is relatively simple, a benchmarking study on patent data shows that TFIDF performs well in situations of long, extended, and highly granular text (Shahmirzadi, Lugowski, and Younge, 2018). The similarity measure is then computed for each pair of patents by determining the angular distance between them using a cosine measure between the patents' two vectors.…”
Section: Results From Testing Assumptionsmentioning
confidence: 99%
“…Each vector is determined via a term‐frequency inverse‐document‐frequency (TFIDF) approach that up‐weights rare words and down‐weights common words. Although the TFIDF approach is relatively simple, a benchmarking study on patent data shows that TFIDF performs well in situations of long, extended, and highly granular text (Shahmirzadi, Lugowski, and Younge, 2018). The similarity measure is then computed for each pair of patents by determining the angular distance between them using a cosine measure between the patents' two vectors.…”
Section: Results From Testing Assumptionsmentioning
confidence: 99%
“…Consequently, many researchers primarily rely on exercise text modeling to characterize exercises [6][7][8]. For instance, Shahmirzadi et al [9] divide the topic text into sections, assign different weights to the resulting divisions, and combine each division vector to represent a complete topic. Zhang et al [10] exclude images and formula information from the input data, using a prefix tree-based lexicon-free word separation algorithm to extract features from the exercise text and reference answers.…”
Section: Unimodal Characterization Methodsmentioning
confidence: 99%
“…Patent metinlerinin benzerliğinin hesaplanmasında TF-TBF yöntemi nitelikli sonuçlar ortaya koymaktadır [18]. Klasik TF-TBF yönteminden başka TF-TBF yöntemlerinin bazı varyantları da patentpatent benzerlik analizinde kullanılmaktadır [18].…”
Section: )unclassified