Topic modeling, a method of statistically extracting thematic content from a large collection of texts, is used for a wide variety of tasks within text analysis. Though there are a growing number of tools and techniques for exploring single models, comparisons between models are generally reduced to a small set of numerical metrics. These metrics may or may not reflect a model's performance on the analyst's intended task, and can therefore be insufficient to diagnose what causes differences between models. In this paper, we explore task-centric topic model comparison, considering how we can both provide detail for a more nuanced understanding of differences and address the wealth of tasks for which topic models are used. We derive comparison tasks from single-model uses of topic models, which predominantly fall into the categories of understanding topics, understanding similarity, and understanding change. Finally, we provide several visualization techniques that facilitate these tasks, including buddy plots, which combine color and position encodings to allow analysts to readily view changes in document similarity.
Many visualizations, including word clouds, cartographic labels, and word trees, encode data within the sizes of fonts. While font size can be an intuitive dimension for the viewer, using it as an encoding can introduce factors that may bias the perception of the underlying values. Viewers might conflate the size of a word's font with a word's length, the number of letters it contains, or with the larger or smaller heights of particular characters ('o' versus 'p' versus 'b'). We present a collection of empirical studies showing that such factors-which are irrelevant to the encoded values-can indeed influence comparative judgements of font size, though less than conventional wisdom might suggest. We highlight the largest potential biases, and describe a strategy to mitigate them.
A valuable task in text visualization is to have viewers make judgments about text that has been annotated (either by hand or by some algorithm such as text clustering or entity extraction). In this work we look at the ability of viewers to make judgments about the relative quantities of tags in annotated text (specifically text tagged with one of a set of qualitatively distinct colors), and examine design choices that can improve performance at extracting statistical information from these texts. We find that viewers can efficiently and accurately estimate the proportions of tag levels over a range of situations; however accuracy can be improved through color choice and area adjustments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.