2019
DOI: 10.1186/s13062-019-0236-y
|View full text |Cite
|
Sign up to set email alerts
|

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Abstract: Background Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 46 publications
0
4
0
Order By: Relevance
“…The sequence segments upstream and downstream of the SS dinucleotide contain information allowing the discrimination of SS and non-SS, such as the BPS, polypyrimidine tract (PPT) or regulatory cis -elements including exon/intron splicing enhancers or silencers (ESE/ISE or ESS/ISS) [ 49 ]. Determining a pertinent sequence length is important because too short genomic regions would prevent the model from using important discriminatory sites, while too large genomic regions may introduce noise-inducing features and loss of accuracy [ 50 ]. We then built CNN prediction models for donor and acceptor SS, using these different sequence lengths.…”
Section: Resultsmentioning
confidence: 99%
“…The sequence segments upstream and downstream of the SS dinucleotide contain information allowing the discrimination of SS and non-SS, such as the BPS, polypyrimidine tract (PPT) or regulatory cis -elements including exon/intron splicing enhancers or silencers (ESE/ISE or ESS/ISS) [ 49 ]. Determining a pertinent sequence length is important because too short genomic regions would prevent the model from using important discriminatory sites, while too large genomic regions may introduce noise-inducing features and loss of accuracy [ 50 ]. We then built CNN prediction models for donor and acceptor SS, using these different sequence lengths.…”
Section: Resultsmentioning
confidence: 99%
“…For each position in donor site-containing sequences, a 2 × 4 contingency table can be built by counting the frequencies of 4 bases in the positive and negative samples. Following on from ChiMIC, Zeng et al [26] compressed the 2 × 4 table of each position into a 2 × l (2 ≤ l ≤ 4) table using local chi-square test, and developed a highperformance approach to predict donor splice sites based on this compression strategy.…”
Section: Compression For the 2 × 20 Contingency Table Of Each Positionmentioning
confidence: 99%
“…This algorithm could search the approximate optimal split by unequal interval optimizing and can capture a wide range of associations, both linear and nonlinear. In this paper, we employed the improved MIC algorithm, Chi-MIC [18], [19], to find suitable splitting points for the numeric attributes.…”
Section: Introductionmentioning
confidence: 99%