This article describes the many capabilities offered by the TraMineR toolbox for categorical sequence data. It focuses more specifically on the analysis and rendering of state sequences. Addressed features include the description of sets of sequences by means of transversal aggregated views, the computation of longitudinal characteristics of individual sequences and the measure of pairwise dissimilarities. Special emphasis is put on the multiple ways of visualizing sequences. The core element of the package is the state sequence object in which we store the set of sequences together with attributes such as the alphabet, state labels and the color palette. The functions can then easily retrieve this information to ensure presentation homogeneity across all printed and graphical displays. The article also demonstrates how TraMineR's outcomes give access to advanced analyses such as clustering and statistical modeling of sequence data.
Summary. This is a comparative study of the multiple ways of measuring dissimilarities between state sequences. The originality of the study is the focus put on the differences between sequences that are sociologically important when studying life courses such as family life trajectories or professional careers. These differences essentially concern the sequencing (the order in which successive states appear), the timing and the duration of the spells in successive states. The study examines the sensitivity of the measures to these three aspects analytically and empirically by means of simulations. Even if some distance measures underperform, the study shows that there is no universally optimal distance index, and that the choice of a measure depends on which aspect we want to focus on. From the review and simulation results, the paper derives guidelines to help the end user to choose the right dissimilarity measure for her or his research objectives. This study also introduces novel ways of measuring dissimilarities that overcome some flaws in existing measures.
In this article we define a methodological framework for analyzing the relationship between state sequences and covariates. Inspired by the ANOVA principles, our approach looks at how the covariates explain the discrepancy of the sequences. We use the pairwise dissimilarities between sequences to determine the discrepancy which makes it then possible to develop a series of statistical-significancebased analysis tools. We introduce generalized simple and multi-factor discrepancy-based methods to test for differences between groups, a pseudo R 2 for measuring the strength of sequence-covariate associations, a generalized Levene statistic for testing differences in the within-group discrepancies, as well as tools and plots for studying the evolution of the differences along the timeframe and a regression tree method for discovering the most significant discriminant covariates and their interactions. In addition, we extend all methods to account for case weights. The scope of the proposed methodological framework is illustrated using a real-world sequence dataset.
Abstract. This paper is concerned with the summarization of a set of categorical sequences. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighbourhood. The proposed heuristic for extracting the representative subset requires as main arguments a pairwise distance matrix, a representativeness criterion and a distance threshold under which two sequences are considered as redundant or, identically, in the neighborhood of each other. It first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in our TraMineR R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.