In this paper we offer a theory of cross-modal objects. To begin, we discuss two kinds of linkages between vision and audition. The first is a duality. The the visual system detects and identifies surfaces; the auditory system detects and identifies sources. Surfaces are illuminated by sources of light; sound is reflected off surfaces. However, the visual system discounts sources and the auditory system discounts surfaces. These and similar considerations lead to the Theory of Indispensable Attributes that states the conditions for the formation of gestalts in the two modalities. The second linkage involves the formation of audiovisual objects, integrated cross-modal experiences. We describe research that reveals the role of cross-modal causality in the formation of such objects. These experiments use the canonical example of a causal link between vision and audition: a visible impact that causes a percussive sound.[A fire is] a terrestrial event with flames and fuel. It is a source of four kinds of stimulation, since it gives off sound, odor, heat and light . . . . One can hear it, smell it, feel it, and see it, or get any combination of these detections, and thereby perceive a fire . . . . For this event, the four kinds of stimulus information and the four perceptual systems are equivalent.