Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

Rio, Angela Lopez-del; Nonell-Canals, Alfons; Vidal, David; Perera-Lluna, Alexandre

doi:10.1021/acs.jcim.8b00663

Cited by 30 publications

(44 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Explanatory models have found use in the formal description of differences in performance as a function of design factors (Lopez-del Rio et al , 2019; Picart-Armada et al , 2019). Following (Picart-Armada et al , 2019), the trends in AUROC and AUPRC were described through logistic-like quasibinomial models with a logit link function, as a generalisation of logistic models to prevent over and under-dispersion issues.…”

Section: Methodsmentioning

confidence: 99%

The effect of statistical normalisation on network propagation scores

Picart‐Armada

Thompson

Buil

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applications. In this work, we characterised some common null models behind the permutation analysis and the statistical properties of the diffusion scores. We benchmarked seven diffusion scores on three case studies: synthetic signals on a yeast interactome, simulated differential gene expression on a protein-protein interaction network and prospective gene set prediction on another interaction network. For clarity, all the datasets were based on binary labels, but we also present theoretical results for quantitative labels. Results Diffusion scores starting from binary labels were affected by the label codification, and exhibited a problem-dependent topological bias that could be removed by the statistical normalisation. Parametric and non-parametric normalisation addressed both points by being codification-independent and by equalising the bias. We identified and quantified two sources of bias -mean value and variance- that yielded performance differences when normalising the scores. We provided closed formulae for both and showed how the null covariance is related to the spectral properties of the graph. Despite none of the proposed scores systematically outperformed the others, normalisation was preferred when the sought positive labels were not aligned with the bias. We conclude that the decision on bias removal should be problem and data-driven, i.e. based on a quantitative analysis of the bias and its relation to the positive entities. Availability The code is publicly available at https://github.com/b2slab/diffuBench Contact sergi.picart@upc.edu

show abstract

Section: Methodsmentioning

confidence: 99%

The effect of statistical normalisation on network propagation scores

Picart‐Armada

Thompson

Buil

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…CNNs imply translational invariance [10] and can be used to find relevant patterns with biological meaning [8,5,11,12]. For their part, bidirectional RNNs (and the derived Long Short-Term Memory and Gated Recurrent Units) are appropiate for modelling biological sequences since they are suited for data with a sequential but non-causal structure, variable length, and long-range dependencies [13,14,15,16]. Both architectures are usually combined, as in DEEPre [17], where a CNN-RNN model performs a hierarchical classification of enzymes.…”

Section: Introductionmentioning

confidence: 99%

“…The analogy between text and proteins, understood as sequences of characters with a meaning, motivates the application of Natural Language Processing (NLP) techniques to amino acid sequences. Along these lines, machine-learning derived embeddings [23,24,25,26] and one-hot encoding [14,9,27,12,17,7] have become very popular. Specifically, the latter method has been widely used in protein-based DL models since neural networks are able to extract features from raw data.…”

Section: Introductionmentioning

confidence: 99%

“…The main problem of one-hot encoding is that each protein has a different length, while all the input vectors should be of the same size to be fed into the model. To overcome this issue, sequence padding and truncation are usually applied [7,12,14,9]. This means establishing a common length for all the proteins and then, truncating longer proteins to that length or filling shorter proteins with an "artificial" character up until that length (see Figure 1A).…”

Section: Introductionmentioning

confidence: 99%

“…Padding zeros can be added at any point of the sequence, for example at the N-and C-terminals of the sequences [28]. In practice they are usually added at the end [7,14]. However, details on the concrete steps when padding the sequences are often omitted as they are deemed of low importance for the results of the study [12,9,27,17].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Effect of Sequence Padding on the Performance of Protein-Based Deep Learning Models

Rio¹,

Martin²,

Perera-Lluna³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Background The use of raw amino acid sequences as input for protein-based deep learning models has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. Results We analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Our results show that padding has an effect on model performance even when there are convolutional layers implied. We propose and implement four novel types of padding the amino acid sequences. Conclusions The present study highlights the relevance of the step of padding the one-hot encoded amino acid sequences when building deep learning-based models for Enzyme Commission number prediction. The fact that this has an effect on model performance should raise awareness on the need of justifying the details of this step on future works. The code of this analysis is available at https://github.com/b2slab/padding_benchmark.

show abstract