Extracting and Rendering Representative Sequences

Gabadinho, Alexis; Ritschard, Gilbert; Studer, Matthias; Müller, Nicolas S.

doi:10.1007/978-3-642-19032-2_7

Cited by 34 publications

(43 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…One example is the extraction of typical patterns from sequence databases, an important objective of sequence analysis. This task, which requires nontrivial heuristic procedures when using pairwise dissimilarities (Gabadinho, Ritschard, Studer, and Müller 2011b), can also be achieved with sequence prediction. Another important and promising application is the analysis of the influence of covariates on the patterns.…”

Section: Resultsmentioning

confidence: 99%

Analyzing State Sequences with Probabilistic Suffix Trees: ThePSTRPackage

Gabadinho¹,

Ritschard²

2016

J. Stat. Soft.

Self Cite

View full text Add to dashboard Cite

This article presents the PST R package for categorical sequence analysis with probabilistic suffix trees (PSTs), i.e., structures that store variable-length Markov chains (VLMCs). VLMCs allow to model high-order dependencies in categorical sequences with parsimonious models based on simple estimation procedures. The package is specifically adapted to the field of social sciences, as it allows for VLMC models to be learned from sets of individual sequences possibly containing missing values; in addition, the package is extended to account for case weights. This article describes how a VLMC model is learned from one or more categorical sequences and stored in a PST. The PST can then be used for sequence prediction, i.e., to assign a probability to whole observed or artificial sequences. This feature supports data mining applications such as the extraction of typical patterns and outliers. This article also introduces original visualization tools for both the model and the outcomes of sequence prediction. Other features such as functions for pattern mining and artificial sequence generation are described as well. The PST package also allows for the computation of probabilistic divergence between two models and the fitting of segmented VLMCs, where sub-models fitted to distinct strata of the learning sample are stored in a single PST.

show abstract

Section: Resultsmentioning

confidence: 99%

Analyzing State Sequences with Probabilistic Suffix Trees: ThePSTRPackage

Gabadinho¹,

Ritschard²

2016

J. Stat. Soft.

Self Cite

View full text Add to dashboard Cite

show abstract

“…3 If there is less than four countries that means that other countries are presented in less than 15% of cluster affiliations, it here is a + sign it meant that there is more than four countries presented on more than 15% of cluster affiliations. 4 That means that at least 75% of a cluster sequences have a distance to representative sequences which is less than 10% of maximal theoretical distance between sequences within a dataset. Refer to [4] for more details about representative sequences.…”

Section: Clustersmentioning

confidence: 99%

Computer scientists from the former USSR

Indukaev

Mogoutov²,

Lépinay

2014

Proceedings of the 10th Central and Eastern European Software Engineering Conference in Russia

View full text Add to dashboard Cite

In the present paper, we develop a new method of longitudinal analysis of bibliographic data in order to explore international mobility of researchers from the former USSR through their publication activity. Firstly, by means of name recognition algorithm using machine learning, we extracted from Web of Science a dataset of publications of more than three thousand of the most active computer scientists from the former Soviet Union. Then, the information on individuals' scientific production is presented in the form of a sequence of states which summarizes the affiliation location for all articles published by a certain author in a given period. We use Optimal Matching algorithm to measure the degree of difference (which, in the sequence analysis, is called distance) between the sequences of individual researchers' activity. The distance between sequences is analyzed by means of hierarchical clustering, which permits us to group computer scientists from the former USSR in several classes according to publication activity patterns. Not surprisingly, ex-soviet researchers having permanent affiliation in their home country are cited less than those who have permanent foreign affiliation. However, those who switch affiliations from former USSR to foreign or the other way round and publish in internationalized groups have one of the highest levels of citation per article among newcomers in discipline. Our research shows that scientific mobility of successful authors can be not only unidirectional, but can take form of a complex go-and-return pattern, the claim which relativizes the "brain drain" paradigm in the analysis of migration of highly qualified specialists from the former URSS. On the methodological level, we propose a new method for analyzing scientific activity which takes into account its longitudinal dynamics. This method can be used for research questions going far beyond the scope of migration studies.

show abstract

“…Although this is not obvious for any kind of complex objects, displaying index-plots like those used in Figure 3 provides a good solution for state sequences. For a somewhat more synthetic view, we could also consider representative plots (Gabadinho, Ritschard, Studer, and Müller 2011b) that show the minimal set of sequences for each node that would be necessary to ensure a given coverage of the sequences at that node.…”

Section: Tree-structured Analysis Of Sequencesmentioning

confidence: 99%

Discrepancy Analysis of State Sequences

Studer

Ritschard

Gabadinho

et al. 2011

Sociological Methods & Research

Self Cite

176

186

View full text Add to dashboard Cite

In this article we define a methodological framework for analyzing the relationship between state sequences and covariates. Inspired by the ANOVA principles, our approach looks at how the covariates explain the discrepancy of the sequences. We use the pairwise dissimilarities between sequences to determine the discrepancy which makes it then possible to develop a series of statistical-significancebased analysis tools. We introduce generalized simple and multi-factor discrepancy-based methods to test for differences between groups, a pseudo R 2 for measuring the strength of sequence-covariate associations, a generalized Levene statistic for testing differences in the within-group discrepancies, as well as tools and plots for studying the evolution of the differences along the timeframe and a regression tree method for discovering the most significant discriminant covariates and their interactions. In addition, we extend all methods to account for case weights. The scope of the proposed methodological framework is illustrated using a real-world sequence dataset.

show abstract

Extracting and Rendering Representative Sequences

Cited by 34 publications

References 9 publications

Analyzing State Sequences with Probabilistic Suffix Trees: ThePSTRPackage

Analyzing State Sequences with Probabilistic Suffix Trees: ThePSTRPackage

Computer scientists from the former USSR

Discrepancy Analysis of State Sequences

Contact Info

Product

Resources

About