The integration of machine learning (ML) into model-driven engineering (MDE) holds the potential to enhance the efficiency of modelers and elevate the quality of modeling tools. However, a consensus is yet to be reached on which MDE tasks can derive substantial benefits from ML and how progress in these tasks should be measured. This paper introduces ModelXGlue , a dedicated benchmarking framework to empower researchers when constructing benchmarks for evaluating the application of ML to address MDE tasks. A benchmark is built by referencing datasets and ML models provided by other researchers, and by selecting an evaluation strategy and a set of metrics. ModelXGlue is designed with automation in mind and each component operates in an isolated execution environment (via Docker containers or Python environments), which allows the execution of approaches implemented with diverse technologies like Java, Python, R, etc. We used ModelXGlue to build reference benchmarks for three distinct MDE tasks: model classification, clustering, and feature name recommendation. To build the benchmarks we integrated existing third-party approaches in ModelXGlue . This shows that ModelXGlue is able to accommodate heterogeneous ML models, MDE tasks and different technological requirements. Moreover, we have obtained, for the first time, comparable results for these tasks. Altogether, it emerges that ModelXGlue is a valuable tool for advancing the understanding and evaluation of ML tools within the context of MDE.