Protein kinases are a protein family that play an important role in several complex diseases such as cancer, cardiovascular and immunological diseases. Kinases have conserved binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multi-target drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of kinase activity data in the public domain, which can be used in many different ways. Multi-task machine learning models are expected to excel for these kinds of datasets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multi-task modelling of sparse data poses two major challenges: (i) creating a balanced train-test split without data leakage and (ii) handling missing data. In this work, we construct a kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random splits based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multi-task deep learning models, on this very sparse dataset, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.