In this paper, we present a procedure to evaluate and compare multiple netflow based network anomaly detection (NF-NAD) systems based on accuracy of detection and mean time of detection. Conventionally, different variations of benign or normal traffic have been used to evaluate NF-NAD systems. Here we showcase a methodology where benign traffic is constant through the entirety of the experiment. We create different variations of synthetic malicious traffic to evaluate and compare NF-NAD systems. A two-phase approach is used to measure the accuracy and learning capability of the NF-NAD system. We have created a designed experiment (having factors, levels and design points) to showcase our methodology.