We assume that substances in the world are represented by two types of concepts, namely substance concepts and classification concepts, the former instrumental to (visual) perception, the latter to (language based) classification. Based on this distinction, we introduce a general methodology for building lexico-semantic hierarchies of substance concepts, where nodes are annotated with the media, e.g., videos or photos, from which substance concepts are extracted, and are associated with the corresponding classification concepts. The methodology is based on Ranganathan's original faceted approach, contextualized to the problem of classifying substance concepts. The key novelty is that the hierarchy is built exploiting the visual properties of substance concepts, while the linguistically defined properties of classification concepts are only used to describe substance concepts. The validity of the approach is exemplified by providing some highlights of an ongoing project whose goal is to build a large scale multimedia multilingual concept hierarchy.