A novel negative sampling based on TFIDF for learning word representation

Qin, Pengda; Xu, Weiran; Guo, Jun

doi:10.1016/j.neucom.2015.11.028

Cited by 31 publications

(12 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The titles of the papers that are published in SIGMOD, VLDB, SIGIR, CIKM, ICDE and EDBT conferences from 2004 to 2013 are selected. From this dataset, we choose the titles of 8,884 papers to test our algorithm and the comparisons, which contains 8,572 terms after removing the stop words, and the values of entries in term-document matrix is assigned by the TF*IDF model [ 53 , 54 ].…”

Section: Resultsmentioning

confidence: 99%

“…For each document d i ∈ D , transform it into vector form V i ( v 1 , v 2 , …, v M ), where v j refers to as described in Definition 1, that is computed by counting the number of times that term t j occurs in document d i . Precisely, v j is usually defined by the normalized TF*IDF (term frequency inverse * document frequency) model [53, 54] that is widely used for measuring the term weights in a document set [55, 56]. Specifically, the entry is assigned as the TF*IDF of term t i that occurs in document d j .…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

An index-based algorithm for fast on-line query processing of latent semantic analysis

Zhang

Wang

2017

PLoS ONE

View full text Add to dashboard Cite

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

An index-based algorithm for fast on-line query processing of latent semantic analysis

Zhang

Wang

2017

PLoS ONE

View full text Add to dashboard Cite

show abstract

“…Hierarchical softmax [44], which builds a Huffman tree, obtains a speedup of log 2 T . Negative sampling [11], [45] is an approximation of the softmax, which chooses to sample several negative examples instead of iterating over the entire vocabulary.…”

Section: A Word2vecmentioning

confidence: 99%

Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels

2019

View full text Add to dashboard Cite

While the traditional method of deriving representations for documents was bag-of-words, they suffered from high dimensionality and sparsity. Recently, many methods to obtain lower dimensional and densely distributed representations were proposed. Paragraph Vector is one of such algorithms, which extends the word2vec algorithm by considering the paragraph as an additional word. However, it generates a single representation for all tasks, while different tasks may require different representations. In this paper, we propose a Supervised Paragraph Vector, a task-specific variant of Paragraph Vector for situations where class labels exist. Essentially, Supervised Paragraph Vector uses class labels along with words and documents and obtains corresponding representations with respect to the particular classification task. In order to prove the benefits of the proposed algorithm, three performance criteria are used: interpretability, discriminative power, and computational efficiency. To test interpretability, we find words that are close and far to class vectors and demonstrate that such words are closely related to the corresponding class. We also use principal component analysis to visualize all words, documents, and class labels at the same time and show that our method effectively displays the related words and documents for each class label. To evaluate discriminative power and computational efficiency, we perform document classification on four commonly used datasets with various classifiers and achieve comparable classification accuracies to bag-of-words and Paragraph Vector. INDEX TERMS Class label, distributed representations, representation learning, document embedding, word embedding.

show abstract

“…It exploited the label information, label relationship, data distribution, as well as correlation among different kinds of features simultaneously. More review can be found in [12][13][14].…”

Section: Introductionmentioning

confidence: 99%

“…As such, there are primarily two classes of sample reduction methods: sampling and vector quantization [12][13][14][15]. Sampling is a method that is generally used to quickly reduce data in an investigated dataset; it is popular due to its simplicity and easy execution, as random sampling (RS) [16].…”

Section: Introductionmentioning

confidence: 99%

An efficient data reduction method and its application to cluster analysis

Wang

Yue

et al. 2017

Neurocomputing

View full text Add to dashboard Cite

：Data reduction plays a very important role in the data mining field, but the existing methods haven't been able to efficiently identify all major features which are hidden in the large datasets. On some occasions, they even cause the loss of the original key features. In this paper, a new efficient measure was developed to reduce a given dataset and to uncover the major features by multiplying the defined absolute density with the defined local density of any data. These two kinds of densities were estimated with a fast grid-based bisecting method. To test its performance on feature reduction and sample reduction, a group of feature-different synthetic datasets and 24 benchmark datasets were used as examples and the clustering accuracy, runtime and separability among clusters were used as measurements. The results strongly proved the proposed method could fast reduce a dataset and identify the most important key features. Additionally, it also can effectively determine the optimal number of clusters by suppressing the noisy data and enhancing the separation among clusters.

show abstract

A novel negative sampling based on TFIDF for learning word representation

Cited by 31 publications

References 18 publications

An index-based algorithm for fast on-line query processing of latent semantic analysis

An index-based algorithm for fast on-line query processing of latent semantic analysis

Supervised Paragraph Vector: Distributed Representations of Words, Documents and Class Labels

An efficient data reduction method and its application to cluster analysis

Contact Info

Product

Resources

About