A human understands the objects in an environment by integrating information obtained by the senses of sight, hearing, and touch. In such integration, active movement plays an important role. We propose a method for determining the correspondence of audiovisual events by handling an object. The method uses the general grouping rules in Gestalt psychology, i.e.``simultaneity" and ``similarity" among motion commands, sound onsets, and motion of objects in images. This system comprises four components, motor, audio, visual, and integration (Fig. 1). In the motor part (Fig. 1 (1)), a computer sends motor commands to the manipulator to handle an unknown object. Then the object emits a sound and causes movement change in images. In the sound part (Fig. 1 (2)), sound onset time series are detected, and similar spectra at each sound onset are grouped into the same sound source of the object. In the visual part (Fig. 1 (3)), moving-object areas and its movements are extracted. The motion loci of object areas are calculated and the changes in object movement are recorded. In the integration part(Fig. 1 (4)), sampling rates of manipulator commands, audio signals, and camera images are converted to the same rate by re-sampling. Finally, we calculate correlations among audio, visual, and motor signals, and events with high correlation are grouped together. In experiments, we used a microphone, a camera, and a robot featuring a hand manipulator. The robot grasps an object like a bell and shakes it, or grasps an object like a stick and beats a drum with it. Those motions are periodic or non-periodic. Then the object emits periodical/non-periodical events. To create a more realistic scenario, we put an another event source (a metronome) in the environment. We conducted two trials for each of 40