The ability to visually discriminate and recognize materials (e.g., candy or crystal) is crucial in planning actions (e.g., determining edibility) in the environment. Meanwhile, language is a powerful channel for communicating the characteristics of materials. However, the connection between these two modalities in material perception has remained elusive. Here, we created a diverse set of perceptually convincing material appearances with deep generative networks that model the statistical structure of real-world photos. Besides generating familiar materials (e.g., soaps, rocks, and squishy toys), we also synthesized unfamiliar novel materials with cross-category morphing (e.g., transforming a soap into a rock). With stimuli sampled from this expansive space, we compared the representations of materials from two cognitive tasks: visual material similarity judgments and verbal descriptions. Our analysis unveiled a moderate but significant correlation between vision and language within individuals. In contrast, the image-based representation derived from the latent code of the generative model only weakly correlated with human visual judgments. Furthermore, we showed that joining image- and semantic-level representations could improve the prediction of human perception. Material perception may actively involve a high-level understanding of materials and objects to resolve the ambiguity of visual information, and it cannot be merely explained by low-to-mid-level image features. Together, our findings highlight the necessity of leveraging semantic description and visual feature extraction to unveil the critical dimensions of material perception and further improve computer vision models targeting material-related tasks.