Network slicing is a key feature of 5G and beyond networks, allowing the deployment of separate logical networks (network slices), sharing a common underlying physical infrastructure, and characterized by distinct descriptors and behaviors. The dynamic allocation of physical network resources among coexisting slices should address a challenging trade-off: to use resources efficiently while assigning each slice sufficient resources to meet its service level agreement (SLA). We consider the allocation of time-frequency resources from a new perspective: to design a control algorithm capable of learning over the operating network, while keeping the SLA violation rate under an acceptable level during the learning process. For this purpose, traditional model-free reinforcement learning (RL) methods present several drawbacks: low sample efficiency, extensive exploration of the policy space, and inability to discriminate between conflicting objectives, causing inefficient use of the resources and/or frequent SLA violations during the learning process. To overcome these limitations, we propose a model-based RL approach built upon a novel modeling strategy that comprises a kernel-based classifier and a self-assessment mechanism. In numerical experiments, our proposal, referred to as kernel-based RL, clearly outperforms state-of-the-art RL algorithms in terms of SLA fulfillment, resource efficiency, and computational overhead.• We present a new perspective that focuses on the importance of learning online (on the real system in operation),