We discuss work extraction from classical information engines (e.g., Szilárd) with N -particles, q partitions, and initial arbitrary non-equilibrium states. In particular, we focus on their optimal behaviour, which includes the measurement of a set of quantities Φ with a feedback protocol that extracts the maximal average amount of work. We show that the optimal non-equilibrium state to which the engine should be driven before the measurement is given by the normalised maximum-likelihood probability distribution of a statistical model that admits Φ as sufficient statistics. Furthermore, we show that the minimax universal code redundancy R ⋆ associated to this model, provides an upper bound to the work that the demon can extract on average from the cycle, in units of k B T . We also find that, in the limit of N large, the maximum average extracted work cannot exceed H[Φ]/2, i.e. one half times the Shannon entropy of the measurement. Our results establish a connection between optimal work extraction in stochastic thermodynamics and optimal universal data compression, providing design principles for optimal information engines. In particular, they suggest that: (i) optimal coding is thermodynamically efficient, and (ii) it is essential to drive the system into a critical state in order to achieve optimal performance.