Due to commonalities between non-factoid question answering and other tasks in which evaluation takes the form of comparison of system-generated and human-generated texts, automatic metrics are commonly borrowed from such tasks. The degree to which widely used metrics produce valid rankings of question answering systems is yet to be thoroughly investigated however, and this is likely due to a lack of reliable methods of human evaluation of systems to provide data for checking the validity of metrics. In this paper, we firstly present a new method of human evaluation of non-factoid question answering systems that can be crowd-sourced cheaply on a very large scale. Secondly, we examine the reliability of the newly developed human evaluation approach revealing the rankings it produces for systems as highly reliable, with results in our self-replication experiment showing system rankings that correlate at 0.984. Finally, we employ the resulting human evaluation of systems as a gold standard against which to assess the validity of a range of automatic metrics widely employed for evaluation of non-factoid question answering, including ROUGE-L, BLEU and Meteor. Results show that ROUGE-L correlates best with human opinion of non-factoid question answering, while metrics such as BLEU are quite substandard in terms of correspondence with human assessment. We highlight the feasibility of the wider reporting of human evaluation results as opposed to metric scores within the field as well as the lack of suitability of metrics such as BLEU for the same task.