This work addresses the problem of multi-task object detection in an efficient, generic but at the same time simple way, following the recent and highly promising studies in the computer vision field, and more specifically the Region-based CNN (R-CNN) approach. A flowenhanced methodology for object detection is proposed, by adding a new branch to predict an object-level flow field. Following a scheme grounded on neuroscience, a pseudo-temporal motion stream is integrated in parallel to the classification, bounding box regression and segmentation mask prediction branches of Mask R-CNN. Extensive experiments and thorough comparative evaluation provide a detailed analysis of the problem at hand and demonstrate the added value of the involved object-level flow branch. The overall proposed approach achieves improved performance in the six currently broadest and most challenging publicly available semantic urban scene understanding datasets, surpassing the region-based baseline method.