Generative pretraining from large-scale transcriptomes: Implications for single-cell deciphering and clinical translation

Shen, Hongru; Shen, Xiaohui; Hu, Jiani; Liu, Jilei; Zhang, Chao; Wu, Dan; Feng, Fengyao; Yang, Meng; Li, Yang; Yang, Yichen; Wang, Wei; Zhang, Qiang; Yang, Jilong; Chen, Kexin; Li, Xiangchun

doi:10.1101/2022.01.31.478596

Cited by 13 publications

(20 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Despite these results, there have been few attempts to adopt the transformer architecture into single-cell biology and applications thereof. Shen et al (2022) use the transformer decoder setup to learn the gene name sequence of highly expressed genes, without considering the actual sequenced expression abundance. This leads to loss of major biological signal, as the expression values are informative of cell state and gene-gene relationships.…”

Section: Related Workmentioning

confidence: 99%

scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

Cui

Wang

Maan

et al. 2022

Preprint

View full text Add to dashboard Cite

Single-cell sequencing has emerged as a promising technique to decode cellular heterogeneity and analyze gene functions. With the high throughput of modern techniques and resulting large-scale sequencing data, deep learning has been used extensively to learn representations of individual cells for downstream tasks. However, most existing methods rely on fully connected networks and are unable to model complex relationships between both cell and gene representations. We hereby propose scFormer, a novel transformer-based deep learning framework to jointly optimize cell and gene embeddings for single-cell biology in an unsupervised manner. By drawing parallels between natural language processing and genomics, scFormer applies self-attention to learn salient gene and cell embeddings through masked gene modelling. scFormer provides a unified framework to readily address a variety of downstream tasks such as data integration, analysis of gene function, and perturbation response prediction. Extensive experiments using scFormer show state-of-the-art performance on seven datasets across the relevant tasks. The scFormer model implementation is available at https://github.com/bowang-lab/scFormer.

show abstract

Section: Related Workmentioning

confidence: 99%

scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

Cui

Wang

Maan

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…tGPT [34], GeneCompass [35], SCimilarity [36], UCE [37] and CellPLM [38]. Details of these models can be found in Appendices D and E. Furthermore, no studies to date have comprehensively evaluated the utility of these models and provided guidance for model training.…”

Section: Introductionmentioning

confidence: 99%

“…There is a limited number of robust pre-trained models (known as single-cell LLMs) capable of managing multiple tasks in single-cell research. Some single-cell LLMs focus on cell-type annotation or gene function prediction, including scBERT [28], tGPT [29], CellLM [30], and Geneformer [31], while others aim to create a foundation model in this area that can handle multiple tasks, including scGPT [32], scFoundation [33] and SCimilarity [34]. Details of these models can be found in Appendices D and E. Furthermore, no studies to date have comprehensively evaluated the utility of these models and provide guidance for model training.…”

Section: Introductionmentioning

confidence: 99%

Evaluating the Utilities of Foundation Models in Single-cell Data Analysis

Liu,

Li,

Wang

et al. 2023

Preprint

View full text Add to dashboard Cite

Large Language Models (LLMs) have made significant strides in both industrial and scientific domains. In this paper, we evaluate the performance of LLMs in single-cell sequencing data analysis through comprehensive experiments across eight downstream tasks pertinent to single-cell data. By comparing seven different single-cell LLMs with task-specific methods, we found that single-cell LLMs may not consistently excel in all tasks than task-specific methods. However, the emergent abilities and the successful applications of cross-species/cross-modality transfer learning of LLMs are promising. In addition, we present a systematic evaluation of the effects of hyper-parameters, initial settings, and stability for training single-cell LLMs based on a proposedscEvalframework, and provide guidelines for pre-training and fine-tuning. Our work summarizes the current state of single-cell LLMs, and points to their constraints and avenues for future developments.

show abstract

“…These initial explorations have revealed a specific key challenge: the inability to transfer the insights gained by these models across different applications due to their distinct architectures. To overcome this, a wave of research initiatives [11,12,13,14,15] is underway, aiming to develop a foundational model that first deciphers latent information from unlabeled scRNA-seq data and then adapts this knowledge to various tasks. In scBERT [11], introduced by Yang and team in 2022, represents genes as tokens and employs an advanced transformer mechanism [16] to analyze over 16,000 gene tokens per cell.…”

Section: Introductionmentioning

confidence: 99%

Bioformers: A Scalable Framework for Exploring Biostates Using Transformers

Amara-Belgadi,

Li,

Zhang

et al. 2023

Preprint

View full text Add to dashboard Cite

Generative pre-trained models, such as BERT and GPT, have demonstrated remarkable success in natural language processing and computer vision. Leveraging the combination of large-scale, diverse datasets, transformers, and unsupervised learning, these models have emerged as a promising method for understanding complex systems like language. Despite the apparent differences, human language and biological systems share numerous parallels. Biology, like language, is a dynamic, interconnected network where biomolecules interact to create living entities akin to words forming coherent narratives. Inspired by this analogy, we explored the potential of using transformer-based unsupervised model development for analyzing biological systems and proposed a framework that can ingest vast amounts of biological data to create a foundational model of biology using BERT or GPT. This framework focuses on the concept of a ‘biostate,’ defined as a high-dimensional vector encompassing various biological markers such as genomic, proteomic, transcriptomic, physiological, and phenotypical data. We applied this technique to a small dataset of single-cell transcriptomics to demonstrate its ability to capture meaningful biological insights into genes and cells, even without any pre-training. Furthermore, the model can be readily used for gene network inference and genetic perturbation prediction.

show abstract

Generative pretraining from large-scale transcriptomes: Implications for single-cell deciphering and clinical translation

Cited by 13 publications

References 49 publications

scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

Evaluating the Utilities of Foundation Models in Single-cell Data Analysis

Bioformers: A Scalable Framework for Exploring Biostates Using Transformers

Contact Info

Product

Resources

About