2023
DOI: 10.1186/s13059-023-03088-4
|View full text |Cite
|
Sign up to set email alerts
|

CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure

Ales Varabyou,
Markus J. Sommer,
Beril Erdogdu
et al.

Abstract: CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 40 publications
0
7
0
Order By: Relevance
“…Even for extensively studied species, gene annotation catalogs are often incomplete, missing both potential gene loci and many transcript isoforms (Amaral et al, 2023; Varabyou et al, 2023). This is one reason why, unlike TranSigner and NanoCount, most existing tools for quantifying transcripts with long-read RNA-seq data prioritize identifying novel isoforms first.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Even for extensively studied species, gene annotation catalogs are often incomplete, missing both potential gene loci and many transcript isoforms (Amaral et al, 2023; Varabyou et al, 2023). This is one reason why, unlike TranSigner and NanoCount, most existing tools for quantifying transcripts with long-read RNA-seq data prioritize identifying novel isoforms first.…”
Section: Resultsmentioning
confidence: 99%
“…TranSigner requires two inputs: a GTF file containing a reference gene annotation of the target transcriptome and a FASTQ file containing long RNA-seq reads. The reference annotation can be obtained from public sources such as RefSeq (O’Leary et al, 2016), GENCODE (Frankish et al, 2019), or CHESS (Varabyou et al, 2023), or it can be derived from transcriptome assemblies produced by programs like StringTie. The latter annotations have the advantage of including novel isoforms while restricting the annotated transcripts to only those found to be expressed in the analyzed sample.…”
Section: Methodsmentioning
confidence: 99%
“…We collected our experimentally identified splice junctions from a large set of assembled transcripts created as part of the initial construction of the CHESS human annotation (Varabyou, Sommer et al 2023), which is based on assemblies of 9,814 RNA-seq samples collected by the GTEx project (Lonsdale, Thomas et al 2013). Note that most of these splice sites do not appear in the final CHESS catalogue.…”
Section: Methodsmentioning
confidence: 99%
“…Although the human protein-coding gene count has been converging on just under 20,000 genes in recent years (Amaral, Carbonell-Sala et al 2023, Varabyou, Sommer et al 2023), multiple recent studies have suggested the possible presence of thousands of additional short protein-coding genes (Ji, Song et al 2015, Calviello, Mukherjee et al 2016, Raj, Wang et al 2016, van Heesch, Witte et al 2019, Chen, Brunner et al 2020, Gaertner, Van Heesch et al 2020, Martinez, Chu et al 2020, Mudge, Ruiz-Orera et al 2022). Most of these proposed novel genes take the form of short open reading frames (ORFs) that occur just upstream or downstream of existing protein-coding genes, apparently on the same messenger RNA.…”
Section: Introductionmentioning
confidence: 99%
“…The variations among individuals and between loci can confound computational methods attempting to distinguish correct and incorrect spliced alignments. As we will show below, this can result in the inclusion of transcripts with spurious junctions in human gene catalogs such as CHESS [5], RefSeq [6], and GENCODE [7].…”
Section: Introductionmentioning
confidence: 99%