Hierarchical Topic modeling (HTM) exploits latent topics and relationships among them as a powerful tool for data analysis and exploration. Despite advantages over traditional topic modeling, HTM poses its own challenges, such as (1) topic incoherence, (2) unreasonable (hierarchical) structure, and(3) issues related to the definition of the "ideal" number of topics and depth of the hierarchy. In this paper, we advance the stateof-the-art on HTM by means of the design and evaluation of CluHTM, a novel nonprobabilistic hierarchical matrix factorization aimed at solving the specific issues of HTM. CluHTM's novel contributions include: (i) the exploration of richer text representation that encapsulates both, global (dataset level) and local semantic information -when combined, these pieces of information help to solve the topic incoherence problem as well as issues related to the unreasonable structure; (ii) the exploitation of a stability analysis metric for defining the number of topics and the "shape" the hierarchical structure. In our evaluation, considering twelve datasets and seven stateof-the-art baselines, CluHTM outperformed the baselines in the vast majority of the cases, with gains of around 500% over the strongest state-of-the-art baselines. We also provide qualitative and quantitative statistical analyses of why our solution works so well.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.