Despite well-documented effects on human health, the action modes of environmental pollutants are incompletely understood. Transcriptome-based approaches are widely used to predict associations between chemicals and disorders. However, the molecular cues regulating gene expression remain unclear. To elucidate the action modes of pollutants, we proposed a data-mining approach, termed "DAR-ChIPEA," combining epigenome (ATAC-Seq) and large-scale public ChIP-Seq data (human,n= 15,155; mouse,n= 13,156) to identify transcription factors (TFs) that are enriched not only in gene-adjacent domains but also across differentially accessible genomic regions, thereby integratively regulating gene expression upon pollutant exposure. The resultant pollutant–TF matrices are then cross-referenced to a repository of TF–disorder associations to account for pollutant modes of action. For example, TFs that regulate Th1/2 cell homeostasis are integral in the pathophysiology of tributyltin-induced allergic disorders; fine particulates (PM2.5) inhibit the binding of C/EBPs, Rela, and Spi1 to the genome, thereby perturbing normal blood cell differentiation and leading to immune dysfunction; and lead induces fatty liver by disrupting the normal regulation of lipid metabolism by altering hepatic circadian rhythms. Thus, our approach has the potential to reveal pivotal TFs that mediate adverse effects of pollutants, thereby facilitating the development of strategies to mitigate environmental pollution damage.
Motivation: Biological sequence classification is the most fundamental task in bioinformatics analysis. For example, in metagenome analysis, binning is a typical type of DNA sequence classification. In order to classify sequences, it is necessary to define sequence features. The k-mer frequency, base composition, and alignment-based metrics are commonly used. In contrast, in the field of image recognition using machine learning, image classification is broadly divided into those based on shape and those based on style. A style matrix was introduced as a method of expressing the style of an image (e.g., color usage and texture). Results: We propose a novel sequence feature, called genomic style, inspired by image classification approaches, for classifying and clustering DNA sequences. As with the style of images, the DNA sequence is considered to have a genomic style unique to the bacterial species, and the style matrix concept is applied to the DNA sequence. Our main aim is to introduce the genomics style as yet another basic sequence feature for metagenome binning problem in replace of the most commonly used sequence feature k-mer frequency. Performance evaluations show that our method using style matrix achieves the superior accuracy than state-of-the-art binning tools based on k-mer frequency.
Motivation Biological sequence classification is the most fundamental task in bioinformatics analysis. For example, in metagenome analysis, binning is a typical type of DNA sequence classification. In order to classify sequences, it is necessary to define sequence features. The k-mer frequency, base composition, and alignment-based metrics are commonly used. On the other hand, in the field of image recognition using machine learning, image classification is broadly divided into those based on shape and those based on style. A style matrix was introduced as a method of expressing the style of an image (e.g., color usage and texture). Results We propose a novel sequence feature, called genomic style, inspired by image classification approaches, for classifying and clustering DNA sequences. As with the style of images, the DNA sequence is considered to have a genomic style unique to the bacterial species, and the style matrix concept is applied to the DNA sequence. Our main aim is to introduce the genomics style as yet another basic sequence feature for metagenome binning problem in replace of the most commonly used sequence feature k-mer frequency. Performance evaluations showed that our method using a style matrix has the potential for accurate binning when compared with state-of-the-art binning tools based on k-mer frequency. Availability and implementation The source code for the implementation of this genomic style method, along with the dataset for the performance evaluation, is available from https://github.com/friendflower94/binning-style. Supplementary information Supplementary data are available at Bioinformatics Advances online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.