Proteomic-based analysis is used to identify biomarkers in blood samples and tissues. Data produced by devices such as mass spectrometry, requires platforms to identify and quantify proteins (or peptides). Clinical information can be related with mass spectrometry data to identify diseases at an early stage. Machine learning techniques can be used to support physicians and biologists for studying and classifying pathologies.
We present the application of machine learning techniques to define a pipeline aimed to study and classify protemics data enriched by means of clinical information. The pipeline allows users to relate established blood biomarkers with clinical parameters and proteomics data. The proposed pipeline entails three main phases: (i) feature selection, (ii) models training and (iii) models ensembling. We report the experience of applying such a pipeline on prostate related diseases. Models have been trained on a datasets which results from the integration of clinical and mass spectrometry based data. The pipeline receives as input data from blood analytes, tissue samples, proteomic analysis and urine biomarkers. It then learns different models for feature selection and classification.The presented pipeline has been applied on two datasets onbtained a 2 years research project which aimed to extract hidden information from mass spectrometry, serum and urine samples from hundreds of patients. We report results on analyzing prostate datasets with 163 samples, including 79 patient, and a urine dataset with 121 patients. As results pipeline allowed to identify interesting peptides in the two datasets, 6 for the first one and 2 for the second one. The best model for both prostate (AUC=0.815, F1=0.8, Specificity=0.75, Sensitivity=0.88) and urine (AUC=0.810, F1=0.824, Specificity=0.805, Sensitivity=0.814) datasets showed good predictive performances.We are confident that the pipeline can be successfully adopted in similar clinical setups.