Motif analysis has long been an important method to characterize biological functionality and the current growth of sequencing-based genomics experiments further extends its potential. These diverse experiments often generate sequence lists ranked by some functional property. There is therefore a growing need for motif analysis methods that can exploit this coupled data structure and be tailored for specific biological questions. Here, we present a motif analysis tool, Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in a ranked list of sequences. Regmex uses regular expressions to define motifs or families of motifs and embedded Markov models to calculate exact probabilities for motif observations in sequences. Motif enrichment is optionally evaluated using random walks, Brownian bridges, or modified rank based statistics. These features make Regmex well suited for a range of biological sequence analysis problems related to motif discovery. We demonstrate different usage scenarios including rank correlation of microRNA binding sites co-occurring with a U-rich motif. The method is available as an R package.
IntroductionAccording to models of known mutational processes, site-specific hotspots of even just a few mutations become unlikely in large cancer genomic datasets (four mutations in our case). These hotspots may affect cancer development or be a consequence of localised mutational processes. Here, we identify and characterise protein-coding and non-coding site-specific hotspots.Material and methodsWe use whole genome sequencing data from 2583 cancer patients across 37 cancer types from Pan-Cancer Analysis of Whole Genomes (PCAWG) under ICGC/TCGA. We identify SNV and indel hotspots genome-wide, annotate them with their genomic features, and investigate expression-correlation and cancer allele fractions.Results and discussionsWe find 566,760 SNV and 1 69 839 indel hotspots, which are genomic positions with two or more SNVs/indels across patients. A small fraction of the hotspots are in protein-coding regions (0.7% for both sets; 3.3x enrichment of local mutation rate in genomic region for SNVs; 1.7x for indels) and regulatory elements of protein-coding genes (0.9%/1.3 x for SNVs; 1.8%/1.04 x for indels). Only a small fraction of the protein-coding hotspots fall in the known drivers from Cancer Gene Census (0.9% for SNVs; 0.8% for indels).Among the top-20 SNV hotspots are 13 positions in known driver sites in protein-coding genes, a known driver site in the TERT promoter, two positions in the PLEKHS1 promoter and a position in a GPR126 intron now known to likely be caused by APOBEC editing, and four non-coding sites possibly caused by different mutational processes.In contrast, none of the top-20 indel hotspots overlap protein-coding genes or regulatory elements. All 20 are deletion-hotspots, and they are located at least 14 kb away from the transcription start site of the nearest protein-coding gene.One third of the SNV hotspots are almost exclusive to a single cancer type. Cancers with high mutational burden and cancer-type specific mutational processes top the list. E.g. colorectal cancer hotspots, likely caused by patients with microsatellite-instability, and melanoma hotspots, likely caused by UV-induced DNA damages.Moreover, analyses of cancer allele fractions and expression correlation in stratified promoter sets indicate a weak signal of positive selection on a few hotspots in promoters of oncogenes.ConclusionWe see no clear driver signal from other non-coding hotspots than two already known positions in the TERT promoter. Mutational processes appear to be the dominating contributor to non-coding hotspots.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.