In the industry 4.0 era, video applications such as surveillance visual systems, video conferencing, or video broadcasting have been playing a vital role. In these applications, for manipulating and tracking objects in decoded video, the quality of decoded video should be consistent because it largely affects the performance of the machine analysis. To cope with this problem, we propose a novel perceptual video coding (PVC) solution in which a full reference quality metric named video multimethod assessment fusion (VMAF) is employed together with a deep convolutional neural network (CNN) to obtain consistent quality while still achieving high compression performance. First of all, in order to achieve the consistent quality requirement, we propose a CNN model with an expected VMAF as input to adaptively adjust the quantization parameters (QP) for each coding block. Afterwards, to increase the compression performance, a Lagrange coefficient of rate-distortion optimization (RDO) mechanism is adaptively computed according to rate-QP and quality-QP models. The experimental results show that the proposed PVC solution has achieved two targets simultaneously: the quality of video sequence is kept consistently with an expected quality level and the bit rate saving of the proposed method is higher than traditional video coding standards and the relevant benchmark, notably with around 10% bitrate saving on average.