Increasing the energy efficiency of deep learning systems is critical for improving the cognitive capability of edge devices, often battery operated, as well as for data centers, constrained by the total power envelope. Specialized architectures accelerated by analog vector-matrix multipliers (VMMs) can reduce by orders of magnitude the energy per operation, since the reduced precision of analog computation does not undermine the classification accuracy of the neural network. We show an analog vector-matrix multiplier fabricated with industry-standard 0.18 µm CMOS process, exploiting a single-transistor non-volatile analog memory cell and dedicated technology circuit co-design. The design is focused on implementation in neural networks performing offline training. The VMM performs the analog multiplication of a vector of inputs, encoded in the duration of time pulses, times a matrix of weights, encoded in the programmable currents of the memory cells. A 1.72 µm 2 memory cell is realized with a single transistor with floating gate, which can be operated as a two-terminal analog memristive device with more than 64 programmable current levels and high I high /I low ratio (> 10 3 ), tuned by the charge injected in the floating gate. A small-area charge amplifier is used to convert the multiply and accumulate operation result into a voltage. System-level projections based on our measurements and simulations provide a throughput of 333.17 GOps/s and an energy efficiency of 122.3 TOps/J, higher than comparable-precision VMMs reported in the literature, and an equivalent area per cell down to 2.15 µm 2 , lower than any similar state-ofthe-art solution. Of critical importance in view of translation to industry, our proposal uses in a new way an industry-standard low-cost single-poly CMOS process flow.