2016
DOI: 10.1002/pra2.2016.14505301065
|View full text |Cite
|
Sign up to set email alerts
|

Document representation methods for clustering bilingual documents

Abstract: Globalization places people in a multilingual environment. There is a growing number of users to access and share information in several languages for public or private purpose. In order to deliver relevant information in different languages, efficient multilingual documents management is worthy of study. Generally, classification and clustering are two typical methods for documents management. However, lack of training data and high efforts for corpus annotation will increase the cost for classifying multilin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
12
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 49 publications
1
12
0
Order By: Relevance
“…Experimental results show that the method performs better than tf-idf with/without stops words and word2vec with/without stop words. A major drawback in their work is that, stops words increase the dimensionality of the feature vectors which impacts badly on the classification accuracy and computational burden [8].Also the classification algorithm used was a linear SVM, other kernels such as string and RBF kernels could produce better results [20].…”
Section: Related Work 21 Web Page Classificationmentioning
confidence: 99%
See 2 more Smart Citations
“…Experimental results show that the method performs better than tf-idf with/without stops words and word2vec with/without stop words. A major drawback in their work is that, stops words increase the dimensionality of the feature vectors which impacts badly on the classification accuracy and computational burden [8].Also the classification algorithm used was a linear SVM, other kernels such as string and RBF kernels could produce better results [20].…”
Section: Related Work 21 Web Page Classificationmentioning
confidence: 99%
“…To achieve high classification result of the Web Page Classification (WPC) system, an excellent representation of textual data (Preprocessing/DR) should contain as much information as possible from the original document [8]. Also, the accuracy of most classification algorithms depends on the quality and size of training data which is inherently dependent on the document representation technique [9].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In this paper, we therefore treat the topics as clusters, and apply the Silhouette Coefficient instead. This method has been previously used for finding the optimal number of topics (Panichella et al, 2013;Ma et al, 2016), and is suitable for our LDA approach, since LDA is fully unsupervised. Nevertheless, in future work, it may be worth evaluating some probability measures such as loglikelihood and perplexity, and comparing the performance using these methods.…”
Section: Lda Modelmentioning
confidence: 99%
“…In the silhouette analysis (Ma et al, 2016), silhouette coefficients close to +1 indicate that the samples in the cluster are far away from the neighbouring clusters. In contrast, a negative silhouette coefficient means that the samples might have been assigned to the wrong cluster.…”
Section: Lda Modelmentioning
confidence: 99%