Motivation Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. Results We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody, or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody, and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. Availability https://github.com/databio/regionset-embedding
Results of functional genomics experiments such as ChIP-Seq or ATAC-Seq produce data summarized as a region set. Many tools have been developed to analyze region sets, including computing similarity metrics to compare them. However, there is no way to objectively evaluate the effectiveness of region set similarity metrics. In this paper we present bedshift, a command-line tool and Python API to generate new BED files by making random perturbations to an original BED file. Perturbed files have known similarity to the original file and are therefore useful to benchmark similarity metrics. To demonstrate, we used bedshift to create an evaluation dataset of 3,600 perturbed files generated by shifting, adding, and dropping regions from a reference BED file. Then, we compared four similarity metrics: Jaccard score, coverage score, Euclidean distance, and cosine similarity. The results show that the Jaccard score is most sensitive to detecting adding and dropping regions, while the coverage score is more sensitive to shifted regions.AvailabilityBSD2-licensed source code and documentation can be found at https://bedshift.databio.org.
Functional genomics experiments, like ChIP-Seq or ATAC-Seq, produce results that are summarized as a region set. There is no way to objectively evaluate the effectiveness of region set similarity metrics. We present Bedshift, a tool for perturbing BED files by randomly shifting, adding, and dropping regions from a reference file. The perturbed files can be used to benchmark similarity metrics, as well as for other applications. We highlight differences in behavior between metrics, such as that the Jaccard score is most sensitive to added or dropped regions, while coverage score is most sensitive to shifted regions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.