Background: Worldwide propagation of minimally invasive surgeries (MIS) is hindered by their drawback of indirect observation and manipulation, while monitoring of surgical instruments moving in the operated body required by surgeons is a challenging problem. Tracking of surgical instruments by vision-based methods is quite lucrative, due to its flexible implementation via software-based control with no need to modify instruments or surgical workflow. Methods: A MIS instrument is conventionally split into a shaft and end-effector portions, while a 2D/3D tracking-by-detection framework is proposed, which performs the shaft tracking followed by the end-effector one. The former portion is described by line features via the RANSAC scheme, while the latter is depicted by special image features based on deep learning through a well-trained convolutional neural network. Results: The method verification in 2D and 3D formulation is performed through the experiments on ex-vivo video sequences, while qualitative validation on in-vivo video sequences is obtained. Conclusion: The proposed method provides robust and accurate tracking, which is confirmed by the experimental results: its 3D performance in ex-vivo video sequences exceeds those of the available state-of -the-art methods. Moreover, the experiments on in-vivo sequences demonstrate that the proposed method can tackle the difficult condition of tracking with unknown camera parameters. Further refinements of the method will refer to the occlusion and multi-instrumental MIS applications.