The field of computational creativity, including musical metacreation, strives to develop artificial systems that are capable of demonstrating creative behavior or producing creative artefacts. But the claim of creativity is often assessed, subjectively only on the part of the researcher, and not objectively at all. This paper provides theoretical motivation for more systematic evaluation of musical metacreation and computationally creative systems, and presents an overview of current methods used to assess human and machine creativity that may be adapted for this purpose. In order to highlight the need for a varied set of evaluation tools, a distinction is drawn between three types of creative systems: those which are purely generative; those which contain internal or external feedback; and those which are capable of reflection and self-reflection. To address the evaluation of each of these aspects, concrete examples of methods and techniques are suggested to help researchers 1) evaluate their systems' creative process and generated artefacts, and test their impact on the perceptual, cognitive, and affective states of the audience, and 2) build mechanisms for reflection into the creative system, including models of human perception and cognition, to endow creative systems with internal evaluative mechanisms to drive self-reflective processes. The first type of evaluation can be considered external to the creative system, and may be employed by the researcher to both better understand the efficacy of their system and its impact, and to incorporate feedback into the system. Here, we take the stance that understanding human creativity can lend insight to computational approaches, and knowledge of how humans perceive creative systems and their output can be incorporated into artificial agents as feedback to provide a sense of how a creation will impact the audience. The second type centers around internal evaluation, in which the system is able to reason about its own behavior and generated output. We argue that creative behavior cannot occur without feedback and reflection by the creative/metacreative system itself. More rigorous empirical testing will allow computational and metacreative systems to become more creative by definition, and can be used to demonstrate the impact and novelty of particular approaches.