Proceedings of the 37th International ACM SIGIR Conference on Research &Amp; Development in Information Retrieval 2014
DOI: 10.1145/2600428.2609597
|View full text |Cite
|
Sign up to set email alerts
|

Continuous word embeddings for detecting local text reuses at the semantic level

Abstract: Text reuse is a common phenomenon in a variety of usergenerated content. Along with the quick expansion of social media, reuses of local text are occurring much more frequently than ever before. The task of detecting these local reuses serves as an essential step for many applications. It has attracted extensive attention in recent years. However, semantic level similarities have not received consideration in most previous works. In this paper, we introduce a novel method to efficiently detect local reuses at … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2015
2015
2020
2020

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 27 publications
0
10
0
Order By: Relevance
“…Gupta et al (2014) proposed an auto-encoder approach to mixed-script query expansion, considering transliterated search queries and learning a character-level ''topic'' joint distribution over features of both scripts. Zhang et al (2014) considered the task of local text reuse, with three annotators labeling whether or not a given passage represents reuse. New variants of DSSM were proposed.…”
Section: A Brief History Of Neural Irmentioning
confidence: 99%
See 1 more Smart Citation
“…Gupta et al (2014) proposed an auto-encoder approach to mixed-script query expansion, considering transliterated search queries and learning a character-level ''topic'' joint distribution over features of both scripts. Zhang et al (2014) considered the task of local text reuse, with three annotators labeling whether or not a given passage represents reuse. New variants of DSSM were proposed.…”
Section: A Brief History Of Neural Irmentioning
confidence: 99%
“…Explicit The goal of Zhang et al (2014) is to efficiently retrieve passages that are semantically similar to a query, making use of hashing methods on word vectors that are learned in advance. Other than the given word vectors, no further deep learning is used.…”
Section: Aggregatementioning
confidence: 99%
“…Use PLSA to cipher word-topic distributions, fold in those distributions at the block level, and so choose segmentation points supported the similarity values of adjacent block pairs. (Sun, Li, Luo& Wu, 2008;Zhang, Kang, Qian& Huang, 2014;Rangel, Faria, Lima & Oliveira, 2016) use LDA on a corpus of segments, inter-segment cipher similarities via a Fisher kernel, and optimize segmentation via dynamic programming. (Misra, Yvon, Jose, & Cappe, 2009;Glavaš, Nanni & Ponzetto, 2016) use a document-level LDA model, treat sections as new documents and predict their LDA models, and so do segmentation via dynamic programming with probabilistic scores.…”
Section: Background and Related Workmentioning
confidence: 99%
“…(Misra et al, 2009;Glavaš, Nanni & Ponzetto, 2016). The presence of logical structure clues within the document, scientific criteria and applied math similarity measures chiefly accustomed figure thematically coherent, contiguous text blocks in unstructured documents (Sun et al, 2008;Zhang et al, 2014;Rangel et al, 2016). Recent segmentation techniques have taken advantage of advances in generative topic modeling algorithms, which were specifically designed to spot issues at intervals text to cipher word-topic distributions (Lee, Han &Whang, 2007;Hung, Peng& Lee, 2015).…”
Section: Background and Related Workmentioning
confidence: 99%
“…Through looking up table (word embedding matrix) of M , the ith question q i can be represented by E w i = {e w ij , 1 ≤ j ≤ N i }, where e w ij is the word embedding of w ij . According to the framework of FK (Clinchant and Perronnin, 2013;Sanchez et al, 2013;Zhang et al, 2014b), questions are modeled by a probability density function. In this work, we use Gaussian mixture model (GMM) to do it.…”
Section: Fisher Vector Generationmentioning
confidence: 99%