Classification of features in a scene typically requires conversion of the incoming photonic field into the electronic domain. Recently, an alternative approach has emerged whereby passive structured materials can perform classification tasks by directly using freespace propagation and diffraction of light. In this manuscript, we present a theoretical and computational study of such systems and establish the basic features that govern their performance. We show that system architecture, material structure, and input light field are intertwined and need to be co-designed to maximize classification accuracy. Our simulations show that a single layer metasurface can achieve classification accuracy better than conventional linear classifiers, with an order of magnitude fewer diffractive features than previously reported. from multiple layer designs. Finally, we show a trade-off between the number of apertures and fabrication noise.