Metadata retrieval from sequence databases with <i>ffq</i>

Gálvez-Merchán, Ángel; Min, Kyung Hoi; Pachter, Lior; Booeshaghi, A. Sina

doi:10.1093/bioinformatics/btac667

Cited by 18 publications

(11 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our open-source Python and command-line program gget enables efficient and easy programmatic access to information stored in a diverse collection of large, public genomic reference databases. gget works alongside existing tools that fetch user-generated sequencing data (Gálvez-Merchán et al, 2022) to replace ineffective, error-prone manual web access during genomic data analysis. While the gget modules were motivated by experience with tedious single-cell RNA-seq data analysis tasks (Supplementary Figure 1), we anticipate their utility for a wide range of bioinformatics tasks.…”

Section: Discussionmentioning

confidence: 99%

Efficient querying of genomic reference databases with gget

Luebbert

Pachter

2023

Bioinformatics

Self Cite

View full text Add to dashboard Cite

Motivation A recurring challenge in interpreting genomic data is the assessment of results in the context of existing reference databases. With the increasing number of command line and Python users, there is a need for tools implementing automated, easy programmatic access to curated reference information stored in a diverse collection of large, public genomic databases. Results gget is a free and open-source command line tool and Python package that enables efficient querying of genomic reference databases, such as Ensembl. gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying required for genomic data analysis in a single line of code. Availability The manual and source code are available at https://github.com/pachterlab/gget. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Discussionmentioning

confidence: 99%

Efficient querying of genomic reference databases with gget

Luebbert

Pachter

2023

Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…For RNA-seq data, we downloaded metaSRA [73] version 1.8 to identify samples associated with potential age and sex information. We then used ffq [74] to fetch sample accession data from the Sequence Read Archive (SRA) [41] to match the sample identifiers used in metaSRA to the run identifiers used in refine.bio. We manually checked these labels as well by reading sample descriptions obtained from SRA.…”

Section: Methodsmentioning

confidence: 99%

Human pan-body age- and sex-specific molecular phenomena inferred from public transcriptome data using machine learning

Johnson

Krishnan

2023

Preprint

View full text Add to dashboard Cite

Age and sex are historically understudied factors in biomedical studies even though many complex traits and diseases vary by these factors in their incidence and presentation. As a result, there are massive gaps in our understanding of genes and molecular mechanisms that underlie sex- and age-associated physiology and disease. Hundreds of thousands of publicly-available human transcriptomes capturing gene expression profiles of tissues across the body and subject to various biomedical and clinical factors present an invaluable, yet untapped, opportunity for bridging these gaps. Here, we present a computational framework that leverages these data to infer genome-wide molecular signatures specific to sex and age groups. As the vast majority of these profiles lack age and sex labels, the core idea of our framework is to use the measured expression data to predict missing age/sex metadata and derive the signatures from the predictive models. We first curated ~30,000 primary samples associated with age and sex information and profiled using microarray and RNA-seq. Then, we used this dataset to infer sex-biased genes within eleven age groups along the human lifespan and then trained machine learning (ML) models to predict these age groups from gene expression values separately within females and males. Specifically, we trained one-vs-rest logistic regression classifiers with elastic-net regularization to classify transcriptomes into age groups. Dataset-level cross validation shows that these ML classifiers are able to discriminate between age groups in a biologically meaningful way in each sex across technologies. Further, these predictive models capture sex-stratified age-group 'gene signatures', i.e., the strength and the direction of importance of genes across the genome for each age group in each sex. Enrichment analysis of these gene signatures with prior gene annotations helped in identifying age- and sex-associated multi-tissue and pan-body molecular phenomena (e.g., general immune response, inflammation, metabolism, hormone response). Overall, we have presented a path for effectively leveraging massive public omics data collections to investigate the molecular basis of age- and sex-differences in physiology and disease.

show abstract

“…The status quo is ad hoc ; there are a variety of different distribution mechanisms, and none is particularly machine-friendly. Much genomic metadata is deposited onto data-oriented databases, such as GEO or dbGap, where metadata is notoriously difficult to process, leading to a variety of dedicated tools for that purpose ( Davis and Meltzer, 2007 ; Chen et al, 2019 ; Gumienny, 2019 ; Choudhary, 2019 ; Ewels et al, 2020 ; Cannizzaro et al, 2021 ; Gálvez-Merchán et al, 2022 ; Garcia et al, 2022 ; Khoroshevskyi et al, 2023 ). Distribution is sometimes intentionally restricted on the basis of privacy.…”

Section: Challenges To Sharing Genomic Metadatamentioning

confidence: 99%