Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1069
|View full text |Cite
|
Sign up to set email alerts
|

SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations

Abstract: We present a feature vector formation technique for documents -Sparse Composite Document Vector (SCDV) -which overcomes several shortcomings of the current distributional paragraph vector representations that are widely used for text representation. In SCDV, word embeddings are clustered to capture multiple semantic contexts in which words occur. They are then chained together to form document topic-vectors that can express complex, multi-topic documents. Through extensive experiments on multi-class and multi-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
50
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
1
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 47 publications
(51 citation statements)
references
References 22 publications
0
50
0
1
Order By: Relevance
“…We defined proprietary functions using the packages published on the Comprehensive R Archive Network that can be used in the R language. In addition, we executed an original algorithm for creating a unique gene signature feature vector based on the sparse composite document vectors (SCDV) [9] method from NLP using only R language operations.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…We defined proprietary functions using the packages published on the Comprehensive R Archive Network that can be used in the R language. In addition, we executed an original algorithm for creating a unique gene signature feature vector based on the sparse composite document vectors (SCDV) [9] method from NLP using only R language operations.…”
Section: Resultsmentioning
confidence: 99%
“…It should be noted that the original SCDV method of NLP, which is the basis of this method, can increase the speed and accuracy using the sparse method [9]. However, in gene signature analysis, the number of genes corresponding to the number of vocabularies is overwhelmingly small compared to natural language; thus, this step was excluded because the above procedure neither increased speed nor improved accuracy.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Furthermore, a simple averaging of the TF-IDF weighting scheme of word vectors to produce document vectors has been investigated in [32]. The sparse composite document vector (SCDV), which was proposed in [33], extended the weighted averaging of word vectors from sentences to documents by using soft clustering over word vectors, while in [34], the approach was extended to capture also the multisense nature of words and to solve the problem of high dimensionality. This was realized by utilizing multisense word embeddings and by learning in a lowerdimensional manifold.…”
Section: Feature Selectionmentioning
confidence: 99%
“…We choose k-medoids 7 (hereafter KM) as our clustering algorithm. For sentence embeddings, we experimented with (i) averaged 300D GloVe embeddings (Pennington et al, 2014), which have been shown to produce surprisingly strong performance in a variety of text classification tasks (Iyyer et al, 2015;Coates and Bollegala, 2018); (ii) skip-thought embeddings (Kiros et al, 2015); and (iii) SCDV (Mekala et al, 2017), a multisense-aware sentence embedding algorithm which builds upon pretrained GloVe embeddings using a Gaussian mixture model. Averaged GloVe embeddings gave the best performance in our experiments; to avoid clutter, we only report those results henceforth.…”
Section: Modelsmentioning
confidence: 99%