Semantic information about objects, events, and scenes influences how humans perceive, interact with, and navigate the world. Most evidence in support of semantic influence on cognition has been garnered from research conducted with an isolated modality (e.g., vision, audition). However, the influence of semantic information has not yet been extensively studied in multisensory environments potentially because of the difficulty in quantification of semantic relatedness. Past studies have primary relied on either a simplified binary classification of semantic relatedness based on category or on algorithmic values based on text corpora rather than human perceptual experience and judgement. With the aim to accelerate research into multisensory semantics, we created a constrained audiovisual stimulus set and derived similarity ratings between items within three categories (animals, instruments, household items). A set of 140 participants provided similarity judgments between sounds and images. Participants either heard a sound (e.g., a meow) and judged which of two pictures of objects (e.g., a picture of a dog and a duck) it was more similar to, or saw a picture (e.g., a picture of a duck) and selected which of two sounds it was more similar to (e.g., a bark or a meow). Judgements were then used to calculate similarity values of any given cross-modal pair. The derived and reported similarity judgements reflect a range of semantic similarities across three categories and items, and highlight similarities and differences among similarity judgments between modalities. We make the derived similarity values available in a database format to the research community to be used as a measure of semantic relatedness in cognitive psychology experiments, enabling more robust studies of semantics in audiovisual environments.