This paper studies the use of Latent Semantic Analysis (LSA) for automatic clustering of Czech news articles. We show that LSA is capable of yielding good results in this task as it allows us to reduce the problem of synonymy. This is a very important factor particularly for Czech, which belongs to a group of highly inflective and morphologicallyrich languages. The experimental evaluation of our clustering scheme and investigation of LSA is performed on query-and category-based test sets. The obtained results demonstrate that the automatic system yields values of the Rand index that are absolutely lower -by 20% -than the accuracy of human cluster annotations. We also show which similarity metric should be used for cluster merging and the effect of dimension reduction on clustering accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.