High scalability and low operating cost make black-box protocol fuzzing a vital tool for discovering vulnerabilities in the firmware of IoT smart devices. However, it is still challenging to compare black-box protocol fuzzers due to the lack of unified benchmark firmware images, complete fuzzing mutation seeds, comprehensive performance metrics, and a standardized evaluation framework. In this paper, we design and implement IoTFuzzBench, a scalable, modular, metric-driven automation framework for evaluating black-box protocol fuzzers for IoT smart devices comprehensively and quantitatively. Specifically, IoTFuzzBench has so far included 14 real-world benchmark firmware images, 30 verified real-world benchmark vulnerabilities, complete fuzzing seeds for each vulnerability, 7 popular fuzzers, and 5 categories of complementary performance metrics. We deployed IoTFuzzBench and evaluated 7 popular black-box protocol fuzzers on all benchmark firmware images and benchmark vulnerabilities. The experimental results show that IoTFuzzBench can not only provide fast, reliable, and reproducible experiments, but also effectively evaluate the ability of each fuzzer to find vulnerabilities and the differential performance on different performance metrics. The fuzzers found a total of 13 vulnerabilities out of 30. None of these fuzzers can outperform the others on all metrics. This result demonstrates the importance of comprehensive metrics. We hope our findings ease the burden of fuzzing evaluation in IoT scenarios, advancing more pragmatic and reproducible fuzzer benchmarking efforts.