A Deep Learning Approach for Process Data Visualization Using t-Distributed Stochastic Neighbor Embedding

Zhu, Wuming; Webb, Zachary; Mao, Kaitian; Romagnoli, José A.

doi:10.1021/acs.iecr.9b00975

Cited by 41 publications

(29 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, PCA can be obtained via singular value decomposition and the dimensionality reduction is limited by the linear correlations found in the input space, t-SNE dimensionality reduction results from minimizing the Kullback-Leibler (KL) divergence over all data points, an normally a bi-dimensional or tri-dimensional space is selected as output to allow visualization of embedded data. The use of t-SNE for applications in data driven modeling has being investigated in very recent years, however the focus has been limited to visualization and fault identification (Zhu et al, 2019;Zheng and Zhao, 2020). In this work, t-SNE was chosen because the mentioned characteristics of the method fit well with the requirements of the application for process phase identification.…”

Section: Methodsmentioning

confidence: 99%

“…This approach has been already tested in different applications, however tuning the ANN to reproduce the manifold learning is rather complex task with many degrees of freedom. Zhu et al (2019) propose an algorithm to implement this approach in the visualization of process data through parametric t-SNE. In this paper an alternative approach is implemented based on SVM for regression.…”

Section: T-distributed Stochastic Neighbour Embeddingmentioning

confidence: 99%

See 1 more Smart Citation

Manifold Learning and Clustering for Automated Phase Identification and Alignment in Data Driven Modeling of Batch Processes

et al. 2020

View full text Add to dashboard Cite

Processing data that originates from uneven, multi-phase batches is a challenge in data-driven modeling. Training predictive and monitoring models requires the data to be in the right shape to be informative. Only then can a model learn meaningful features that describe the deterministic variability of the process. The presence of multiple phases in the data, which display different correlation patterns and have an uneven duration from batch to batch, reduces the performance of the data-driven modeling methods significantly. Therefore, phase identification and alignment is a critical step and can lead to an unsuccessful modeling exercise if not applied correctly. In this paper, a novel approach is proposed to perform unsupervised phase identification and alignment based on the correlation patterns found in the data. Phase identification is performed via manifold learning using t-Distributed Stochastic Neighbor Embedding (t-SNE), which is a state-of-the-art machine learning algorithm for non-linear dimensionality reduction. The application of t-SNE to a reduced cross-correlation matrix of every batch with respect to a reference batch results in data clustering in the embedded space. Models based on support vector machines (SVMs) are trained to, 1) reproduce the manifold learning obtained via t-SNE, and 2) determine the membership of the data points to a process phase. Compared to previously proposed clustering approaches for phase identification, this is an unsupervised, non-linear method. The perplexity parameter of the t-SNE algorithm can be interpreted as the estimated duration of the shortest phase in the process. The advantages of the proposed method are demonstrated through its application on an in-silico benchmark case study, and on real industrial data from two unit-operations in the large scale production of an active pharmaceutical ingredients (API). The efficacy and robustness of the method are evidenced in the successful phase identification and alignment obtained for these three distinct processes, displaying smooth, sudden and repetitive phase changes. Additionally, the low complexity of the method makes feasible its online implementation.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: T-distributed Stochastic Neighbour Embeddingmentioning

confidence: 99%

Manifold Learning and Clustering for Automated Phase Identification and Alignment in Data Driven Modeling of Batch Processes

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In literature, 21 it is mentioned that fine-tuning the trained NN with a small group of unrelated samples can lead to a good result. Zhu et al 26 generated additional dummy data to fine tune a reformulated-structured pt-SNE for outlier mapping to realize good industrial process data visualization. However, how to generate additional unrelated data for fine-tuning is still a great challenge.…”

Section: T-sne and Its Out-of-sample Extensionsmentioning

confidence: 99%

Out-of-sample data visualization using bi-kernel t-SNE

Zhang

Wang

Gao

et al. 2020

Information Visualization

View full text Add to dashboard Cite

T-distributed stochastic neighbor embedding (t-SNE) is an effective visualization method. However, it is non-parametric and cannot be applied to steaming data or online scenarios. Although kernel t-SNE provides an explicit projection from a high-dimensional data space to a low-dimensional feature space, some outliers are not well projected. In this paper, bi-kernel t-SNE is proposed for out-of-sample data visualization. Gaussian kernel matrices of the input and feature spaces are used to approximate the explicit projection. Then principal component analysis is applied to reduce the dimensionality of the feature kernel matrix. Thus, the difference between inliers and outliers is revealed. And any new sample can be well mapped. The performance of the proposed method for out-of-sample projection is tested on several benchmark datasets by comparing it with other state-of-the-art algorithms.

show abstract

“…By matching distances between high-dimensional and low-dimensional spaces, t-distributed stochastic neighbor embedding (t-SNE) is a dimensionality reduction algorithm retaining the original clustering [46]. The whole procedure of the t-SNE is given in the following steps.…”

Section: ) T-distributed Stochastic Neighbor Embeddingmentioning

confidence: 99%

“…Thirdly, the effects of the over-sampling methods including random over-sampling (ROS), synthetic minority over-sampling technique (SMOTE) [40], Border-line SMOTE [41], SVM-SMOTE [42] and Adasyn [43] are systematically explored using the top 2 prediction algorithms that achieve the best performance. Finally, to determine the best prediction model, different feature selection methods including mutual information (MI) [44], autoencoder (AE) [45], and t-distributed stochastic neighbor embedding (t-SNE) [46] are respectively incorporated into the top 2 models constructed by a combination of the prediction algorithm and over-sampling methods. Compared with exiting methods, experimental results demonstrate that the proposed method achieves a superior performance in terms of various performance measures.…”

Section: Introductionmentioning

confidence: 99%

PCSPred_SC: Prediction of Protein Citrullination Sites Using an Effective Sequence-Based Combined Method

et al. 2020

View full text Add to dashboard Cite

As one of post-translational modifications (PTMs), protein citrullination is crucial in a diverse array of cellular processes and implicated in a slew of human pathology. Therefore, accurate identification of protein citrullination sites (PCSs) is urgently needed to illuminate the reaction details and the complex pathogenesis related to the protein citrullination. In view of the limitations of the existing PCS predictors, this study proposes a novel and powerful sequence-based combined method named PCSPred_SC to further enhance the prediction performance. Various feature extraction methods are developed to mine sequence-derived biological information. Under the feature space, the predictive capabilities of different prediction algorithms, over-sampling methods, and feature selection methods are respectively explored. Experimental results indicate that the over-sampling methods are effective to solve the imbalanced dataset problem and the feature selection methods are significant in removing irrelevant and redundant features. On the same dataset using 10-fold cross validation, PCSPred_SC constructed by the combination of support vector machine (SVM), Adasyn, and t-distributed stochastic neighbor embedding (t-SNE) achieves much more outstanding performance than the competing methods, while reducing the number of features used for this task remarkably. It is anticipated that the proposed method will provide significant information to broaden our knowledge of citrullination-related biological processes.

show abstract

A Deep Learning Approach for Process Data Visualization Using t-Distributed Stochastic Neighbor Embedding

Cited by 41 publications

References 14 publications

Manifold Learning and Clustering for Automated Phase Identification and Alignment in Data Driven Modeling of Batch Processes

Manifold Learning and Clustering for Automated Phase Identification and Alignment in Data Driven Modeling of Batch Processes

Out-of-sample data visualization using bi-kernel t-SNE

PCSPred_SC: Prediction of Protein Citrullination Sites Using an Effective Sequence-Based Combined Method

Contact Info

Product

Resources

About