Many engineered approaches have been proposed over the years for solving the hard problem of performing indoor localization. However, specialising solutions for the edge cases remains challenging. Here we propose to build the solution with zero hand-engineered features, but having everything learned directly from data. We use a modality specific neural architecture for extracting preliminary features, which are then integrated with cross-modality neural network structures. We show that each modality-specific neural architecture branch is capable of estimating the location with good accuracy independently. But for better accuracy a cross-modality neural network fusing the features of those early modality-specific representations is a better proposition. Our multimodal neural network, MM-Loc, is effective because it allows the uniform flow of gradients during training across modalities. Because it is a data driven approach, complex features representations are learned rather than relying heavily on hand-engineered features.