Embodied Conversational Agents (ECA) take on different forms, including virtual avatars or physical agents, such as a humanoid robot. ECAs are often designed to produce nonverbal behaviour to complement or enhance its verbal communication. One form of nonverbal behaviour is co-speech gesturing, which involves movements that the agent makes with its arms and hands that is paired with verbal communication. Co-speech gestures for ECAs can be created using different generation methods, such as rule-based and data-driven processes. However, reports on gesture generation methods use a variety of evaluation measures, which hinders comparison. To address this, we conducted a systematic review on co-speech gesture generation methods for iconic, metaphoric, deictic or beat gestures, including their evaluation methods. We reviewed 22 studies that had an ECA with a human-like upper body that used co-speech gesturing in a social human-agent interaction, including a user study to evaluate its performance. We found most studies used a within-subject design and relied on a form of subjective evaluation, but lacked a systematic approach. Overall, methodological quality was low-to-moderate and few systematic conclusions could be drawn. We argue that the field requires rigorous and uniform tools for the evaluation of co-speech gesture systems. We have proposed recommendations for future empirical evaluation, including standardised phrases and test scenarios to test generative models. We have proposed a research checklist that can be used to report relevant information for the evaluation of generative models as well as to evaluate co-speech gesture use.