Through integrated sensors, wearable devices such as fitness trackers and smart-watches provide convenient interfaces by which multimodal time-series data may be recorded. Fostering multimodality in data collection allows for the observation of recorded actions, exercises or performances with consideration towards multiple transpiring aspects. This paper details an exploration of machine-learning based classification upon a dataset of audio-gestural violin recordings, collated through the use of a purpose-built smartwatch application. This interface allowed for the recording of synchronous gestural and audio data, which proved well-suited towards classification by deep neural networks (DNNs). Recordings were segmented into individual bow strokes, these were classified through completion of three tasks: Participant Recognition, Articulation Recognition, and Scale Recognition. Higher participant classification accuracies were observed through the use of lone gestural data, while multi-input deep neural networks (MI-DNNs) achieved varying increases in accuracy during completion of the latter two tasks, through concatenation of separate audio and gestural subnetworks. Across tasks and across network architectures, testclassification accuracies ranged between 63.83% and 99.67%. Articulation Recognition accuracies were consistently high, averaging 99.37%.