Vision and audition have complementary affinities, with vision excelling in spatial resolution and audition excelling in temporal resolution. Here, we investigate the relationships among visual and auditory modalities and spatial and temporal short-term memory (STM) using change detection tasks. We created short sequences of visual or auditory items, such that each item within a sequence arose at a unique spatial location at a unique time. On each trial, two successive sequences were presented; subjects attended to either space (the sequence of locations), or time (the sequence of inter-item intervals), and reported whether the patterns of locations or intervals were identical. Each subject completed blocks of unimodal trials (both sequences presented in the same modality) and crossmodal trials (sequence 1 visual and sequence 2 auditory, or vice versa) for both spatial and temporal tasks. We found a strong interaction between modality and task: spatial performance was best on unimodal visual trials, while temporal performance was best on unimodal auditory trials. The order of modalities on crossmodal trials also mattered, suggesting that perceptual fidelity at encoding is critical to STM. Critically, there was no cost attributable to crossmodal comparison: in both tasks, performance on crossmodal trials was as good or better than on the weaker unimodal trials. STM representations of space and time can guide change detection in either the visual or the auditory modality, suggesting that temporal or spatial organization of STM may supersede sensory-specific organization.