In recent years many tools and algorithms for model comparison and differencing were proposed. Typically, the main focus of the research laid on being able to compute the difference in the first place. Only very few papers addressed the quality of the delivered differences sufficiently. Hence, this is a general shortcoming in the state-of-the-art. Currently, there are no established community standards how to assess the quality of differences and it is neither possible to compare the quality of different algorithms, nor can developers decide whether or not an algorithm is able to produce adequate results in a given application scenario. We propose a parallel working session to be held to discuss this general problem and its implications. The goal of the working session is to achieve a common understanding of what the crucial factors in assessing the quality of differences are. Furthermore, it is planed to discuss possible solutions that help the research community as whole, e.g. by drafting the design of an initial benchmark corpus which later could be turned into a standardized, openly available benchmark set.
MOTIVATIONMany different algorithms for model differencing were proposed in recent years; surveys can be found in [5,6,7]. Typically, the main emphasis of these approaches is focused on the algorithms used to compute the difference between two revisions of a model. The evaluation of the algorithms, if conducted at all, is usually based only on sets consisting of few, small and sometimes even specially created, test models. Obviously, such an evaluation is not enough to assess the quality of the algorithms objectively, nor is it possible to compare the quality of different algorithms this way. Hence, developers which must choose a model differencing engine in their day-to-day work can not make informed decisions which tool fits their needs best.The reasons why very few papers proposing model differencing algorithms address the quality of the delivered differences sufficiently are manifold:• Ultimately, the quality of a difference can only be assessed in the context of the use case in which the difference is used. If the difference between two models is to be displayed to developers then understandability and compactness [2] are highly relevant; in the context of merging, the avoidance of merge conflicts is important. If a difference is to be used for internal delta storage or for batch patching of models, none of the above properties are relevant and it is sufficient to simply save the smallest representation of the difference.• Properties such as quality and understandability are not generally defined and must be refined further for each specific model type and paradigm.• Test models where the evolution is sufficiently known and documented are not available for many domains. While model generators [8,9] can be used to create realistic test models synthetically, configuring these tool is time intensive and requires in-depth knowledge of edit processes for a given domain.• Some generic algorithms ma...