Production time of stroke gestures is a fundamental measure of user performance with Graphical User Interfaces. However, production time represents an overall quantification of the user's gesture articulation process and therefore provides an incomplete picture of such process. Moreover, previous approaches assumed stroke gestures as synchronous point sequences, when most gesture-driven applications have to deal with asynchronous point sequences. Furthermore, deep generative models of human handwriting ignore the temporal information, thereby missing a key component of the user's gesture articulation process. To solve these issues, we introduce DITTO, a sequence-to-sequence deep learning model that estimates the velocity profile of any stroke gesture using spatial information only, providing thus a fine-grained estimation of the moment-bymoment behavior of the user's articulation performance. We show that this unique capability makes DITTO remarkably accurate while handling gestures of any type: unistrokes, multistrokes, and multitouch gestures. Our model, code, and associated web application are available as open source software.