SUMMARYEmpirical systems research is facing a dilemma. Minor aspects of an experimental setup can have a significant impact on its associated performance measurements and potentially invalidate conclusions drawn from them. Examples of such influences, often called hidden factors, include binary link order, process environment size, compiler generated randomized symbol names, or group scheduler assignments. The growth in complexity and size of modern systems will further aggravate this dilemma, especially with the given time pressure of producing results. How can one trust any reported empirical analysis of a new idea or concept in computer science? DataMill is a community-based services-oriented open benchmarking infrastructure for rigorous performance evaluation. DataMill facilitates producing robust, reliable, and reproducible results. The infrastructure incorporates the latest results on hidden factors and automates the variation of these factors. DataMill is also of interest for research on performance evaluation. The infrastructure supports quantifying the effect of hidden factors, disseminating the research results beyond mere reporting. It provides a platform for investigating interactions and composition of hidden factors. This paper discusses experience earned through creating and using an open benchmarking infrastructure. Multiple research groups participate and have used DataMill. Furthermore, DataMill has been used for a performance competition at the International Conference on Runtime Verification (RV) 2014 and is currently hosting the RV 2015 competition. This paper includes a summary of our experience hosting the first RV competition.