A sparse channel state information (CSI) estimation model is proposed for reducing the pilot overhead of orthogonal time frequency space (OTFS) modulation aided multipleinput multiple-output (MIMO) systems. Explicitly, the pilots are directly transmitted over the time-frequency (TF)-domain grid for estimating the delay-Doppler (DD)-domain CSI that leads to a reduction of the pilot overhead, training duration and pre-processing complexity. Furthermore, it completely avoids placing multiple DD-domain guard intervals corresponding to each transmit antenna within the same OTFS frame, while keeping the training duration flexible, hence increasing the bandwidth efficiency. A unique benefit of the proposed CSI estimation model is that it can efficiently handle fractional Dopplers also. The resultant DD-domain CSI becomes simultaneously row and group (RG)-sparse. To exploit this compelling property, an orthogonal matching pursuit (OMP)-based RG-OMP technique is developed, conveniently complemented by an enhanced Bayesian learning (BL)-based RG-BL framework, both of which substantially outperform the state-of-the-art methods. Furthermore, low-complexity linear detectors are designed for the ensuing data detection phase, which directly employ the estimated DD-domain sparse CSI, without assuming any further knowledge concerning the number of dominant multipath components. Finally, simulation results are provided to demonstrate performance improvement of the proposed BL-based schemes over the OMP and the state-of-the-art schemes.