When it comes to single-cell gene-expression data, biologists face an embarrassment of riches. There are thousands of data sets to choose from. Unfortunately, those data sets have not all been processed in the same way; they might use different names for similar or identical cells or tissues; and they are scattered across the Internet -or available only on request.Using any one data set is relatively straightforward. But collecting, curating and integrating the data to draw conclusions across experiments, is -in the words of bioinformatician Timothy Triche Jr at the Van Andel Institute in Grand Rapids, Michigan -"a huge pain in the butt".In one 2023 study 1 , for instance, computational biologist Christina Theodoris at Gladstone Institutes in San Francisco, California, described a deep-learning model called Geneformer. Building on some 30 million single-cell transcriptomic data sets that Theodoris manually aggregated in 2021, Geneformer allows researchers to predict the impact of gene perturbations in cell types or genes it has never seen. But because the data were scattered across 18 public databases and multiple independent laboratories, she says, "it took me two months to collect all that data and process it".
"That's approximately 11-and-a-half million more cells than we would typically run."The CZ CELLxGENE tool helps researchers to visualize gene-expression data.