SCDV : Sparse Composite Document Vectors using soft clustering over
            distributional representations

Mekala, Dheeraj; Gupta, Vivek; Paranjape, Bhargavi; Karnick, Harish

doi:10.18653/v1/d17-1069

Cited by 47 publications

(51 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We defined proprietary functions using the packages published on the Comprehensive R Archive Network that can be used in the R language. In addition, we executed an original algorithm for creating a unique gene signature feature vector based on the sparse composite document vectors (SCDV) [9] method from NLP using only R language operations.…”

Section: Resultsmentioning

confidence: 99%

“…It should be noted that the original SCDV method of NLP, which is the basis of this method, can increase the speed and accuracy using the sparse method [9]. However, in gene signature analysis, the number of genes corresponding to the number of vocabularies is overwhelmingly small compared to natural language; thus, this step was excluded because the above procedure neither increased speed nor improved accuracy.…”

Section: Methodsmentioning

confidence: 99%

“…Beginning with Doc2Vec [8], which used a distributed representation of words, innovative techniques related to the distributed expression of a large number of sentences have been proposed in the past several years, and the accuracy of document interpretation has improved [9]. Typical methods of distributed representation of documents include statistical semantic extraction methods [10], methods that combine distributed representations of words [11] into document representations [12], methods that directly compress word and document IDs [8], methods of summing word vectors by multiplying the topics and specificities in the documents [9].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Comprehensive biological interpretation of gene signatures using semantic distributed representation

Okuzono

Hoshino

2019

Preprint

View full text Add to dashboard Cite

16Recent rise of microarray and next-generation sequencing in genome-related fields has 17 simplified obtaining gene expression data at whole gene level, and biological interpretation of 18 gene signatures related to life phenomena and diseases has become very important. However, 19 the conventional method is numerical comparison of gene signature, pathway, and gene 20 ontology (GO) overlap and distribution bias, and it is not possible to compare the specificity 21 and importance of genes contained in gene signatures as humans do. 22This study proposes the gene signature vector (GsVec), a unique method for interpreting 23 gene signatures that clarifies the semantic relationship between gene signatures by 24 incorporating a method of distributed document representation from natural language 25 processing (NLP). In proposed algorithm, a gene-topic vector is created by multiplying the 26 feature vector based on the gene's distributed representation by the probability of the gene 27 signature topic and the low frequency of occurrence of the corresponding gene in all gene 28 signatures. These vectors are concatenated for genes included in each gene signature to create 29 a signature vector. The degrees of similarity between signature vectors are obtained from the 30 cosine distances, and the levels of relevance between gene signatures are quantified. 31 Using the above algorithm, GsVec learned approximately 5,000 types of canonical 32 pathway and GO biological process gene signatures published in the Molecular Signatures 33 Database (MSigDB). Then, validation of the pathway database BioCarta with known 3 34biological significance and validation using actual gene expression data (differentially 35 expressed genes) were performed, and both were able to obtain biologically valid results. In 36 addition, the results compared with the pathway enrichment analysis in Fisher's exact test 37 used in the conventional method resulted in equivalent or more biologically valid signatures. 38Furthermore, although NLP is generally developed in Python, GsVec can execute the entire 39 process in only the R language, the main language of bioinformatics. 40 41 4 53 and completeness of human knowledge. Therefore, interpretation is commonly performed by 54 comparing the gene signature, such as differentially expressed genes and gene modules, 55 against a biological gene signature database (such as pathway and GO) and identifying an 56 objective association from a biological perspective [2]. 57 Numerous methodologies for association with pathways have been proposed. Common 58 examples include Fisher's exact test, which is a classical statistical test for the specific overlap 59 of genes; over-representation analysis and gene set enrichment analysis [3], which statistically 60 process the number of overlapping genes and ranking bias by incorporating randomization; 61 and modular enrichment analysis and EnrichNet with graph-based statistics of biological 62 networks [4, 5].63However, these comparisons are numerical, and it is thus not p...

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Comprehensive biological interpretation of gene signatures using semantic distributed representation

Okuzono

Hoshino

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Furthermore, a simple averaging of the TF-IDF weighting scheme of word vectors to produce document vectors has been investigated in [32]. The sparse composite document vector (SCDV), which was proposed in [33], extended the weighted averaging of word vectors from sentences to documents by using soft clustering over word vectors, while in [34], the approach was extended to capture also the multisense nature of words and to solve the problem of high dimensionality. This was realized by utilizing multisense word embeddings and by learning in a lowerdimensional manifold.…”

Section: Feature Selectionmentioning

confidence: 99%

An AI-Based Methodology for the Automatic Classification of a Multiclass Ebook Collection Using Information From the Tables of Contents

Giannopoulou

Mitrou

2020

IEEE Access

View full text Add to dashboard Cite

Book recommendation to support professors and students in the identification of relevant sources is of significant importance for both universities and digital libraries and, hence, motivates the development of a recommendation system. This paper aims at automatically classifying a multiclass corpus that was created from ebooks from the Springer collection, which is available through the Hellenic Academic Libraries' subscription, by utilizing an unsupervised neural network (NN) (self-organizing maps, SOM) and two deep neural network (DNN) architectures, namely, a long short-term memory (LSTM) and a convolutional neural network (CNN) combined with a LSTM(CNN+LSTM) under various configuration scenarios. The vector construction leverages information that was extracted from the table of contents (ToC) of each book using the TF-IDF weighting scheme (for the first case) and the Keras tokenizer (for the second). Extensive experiments were conducted using various configurations of preprocessing steps, NN set up and vector and vocabulary sizes to assess their impact on the classifier's performance. Furthermore, we show that majority voting is more suitable for selecting the dominant label for a specified node. The experimental analysis showed the feasibility of developing a recommendation system for supporting professors and students in the identification of related sources based on a detailed thematic description (e.g., abstract or table of contents of a book) rather than a few keywords. In the conducted experiments, the subsystem that utilized the DNN (LSTM) performed the best, with F1-scores of 67% for the 26 categories and 80% for the 5 general categories, whereas SOM realizes F1-scores of less than 5% in both cases.

show abstract

“…We choose k-medoids 7 (hereafter KM) as our clustering algorithm. For sentence embeddings, we experimented with (i) averaged 300D GloVe embeddings (Pennington et al, 2014), which have been shown to produce surprisingly strong performance in a variety of text classification tasks (Iyyer et al, 2015;Coates and Bollegala, 2018); (ii) skip-thought embeddings (Kiros et al, 2015); and (iii) SCDV (Mekala et al, 2017), a multisense-aware sentence embedding algorithm which builds upon pretrained GloVe embeddings using a Gaussian mixture model. Averaged GloVe embeddings gave the best performance in our experiments; to avoid clutter, we only report those results henceforth.…”

Section: Modelsmentioning

confidence: 99%

Picking Apart Story Salads

Wang

Holgate²,

Durrett

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

During natural disasters and conflicts, information about what happened is often confusing, messy, and distributed across many sources. We would like to be able to automatically identify relevant information and assemble it into coherent narratives of what happened. To make this task accessible to neural models, we introduce Story Salads, mixtures of multiple documents that can be generated at scale. By exploiting the Wikipedia hierarchy, we can generate salads that exhibit challenging inference problems. Story salads give rise to a novel, challenging clustering task, where the objective is to group sentences from the same narratives. We demonstrate that simple bag-of-words similarity clustering falls short on this task and that it is necessary to take into account global context and coherence. (A) Some of the prisoners were survivors of the Battle of Qala-i-Jangi in Mazar-i-Sharif. (A) Chechnya came under the influence of warlords. (B) The U.S. invaded Afghanistan the same year when several Taliban prisoners were shot. (A) Russian federal troops entered Chechnya and ended its independence. (A) The Russian casualties included at least two commandos killed and 11 wounded. (B) The dead were buried in the same grave under the authority of Commander Kamal.2 In particular, while we do not focus on creating mixtures with conflicting information, it can often be found in mixtures created based on Wikipedia categories, as shown in Figure 2.

show abstract

SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations

Cited by 47 publications

References 22 publications

Comprehensive biological interpretation of gene signatures using semantic distributed representation

Comprehensive biological interpretation of gene signatures using semantic distributed representation

An AI-Based Methodology for the Automatic Classification of a Multiclass Ebook Collection Using Information From the Tables of Contents

Picking Apart Story Salads

Contact Info

Product

Resources

About