Multi-device systems of cameras and various depth sensors are widely used these days in the industry. Some can also operate well in conditions where others cannot (e.g. active laser sensors compared to cameras). Various views and sensors of different modalities reveal valuable information about the environment, which is crucial for the robust operation of, e.g., detection or decision-making algorithms. The fusion of sensor data, calibration and multiple-view geometry are critical topics discussed in this dissertation.Sensors are usually placed in a common frame of reference determined by intrinsic and other parameters describing sensor alignment. During the process of calibration, all parameters can be estimated and tuned based on correspondences between sensor views. The mainstream computer vision methods solving geometric tasks use corresponding points across views as an input. Based on the correspondences, it is possible to estimate the underlying geometry of the views (i.e., epipolar geometry). As a next step, the 3D structure of the scene can also be determined. Corresponding image points are established as the extracted centers of dominant image regions. However, the shape and orientation of the regions also contain useful information as they are related to the underlying surface. Only a minority of the computer vision community tries to utilize this inter-region relation, mostly constraining their approaches to the basic pinhole camera model. The first-order approximation of a region correspondence is called an Affine Correspondence (AC). Sensor fusion is also an important part of modern systems that observe and analyze the environment. Sensors of distinct views and different modalities (e.g., depth and color images) complement each other. A low-resolution Time-of-Flight (ToF) depth camera image can be supersampled using a high-resolution color image of the same view as guidance, while depth may provide e.g. a posterior option for refocusing the color view.This thesis provides a thorough investigation of ACs in the theoretical and algorithmic sense to obtain rapid, more robust and high-quality geometric model estimation in twoor multiple-view cases. It is demonstrated that ACs are usable in general scenarios where real-world cameras and diverse geometry complicate the task. Next, the data-level sensor fusion of high-resolution color-and lower-resolution ToF depth-cameras is investigated for single pairs of frames and also for video sequences. Finally, the calibration of multi-sensor systems is discussed, that include LiDARs and cameras (wide-angle, fisheye optics, etc.). i I would like to express my gratitude to my colleagues at Institute for Computer Science and Control (SZTAKI) and at Eötvös Loránd University (ELTE) whom I had the pleasure to work with, especially, to my advisor Dmitry Chetverikov for introducing me to image processing and computer vision, guiding me throughout my Ph.D. study and research, and all his support, knowledge and critical view on research he shared with me. I would like to tha...