We present the first work that investigates the potential of improving the performance of transportation mode recognition through fusing multimodal data from wearable sensors: motion, sound and vision. We first train three independent deep neural network (DNN) classifiers, which work with the three types of sensors, respectively. We then propose two schemes that fuse the classification results from the three mono-modal classifiers. The first scheme makes an ensemble decision with fixed rules including Sum, Product, Majority Voting, and Borda Count. The second scheme is an adaptive fuser built as another classifier (including Naive Bayes, Decision Tree, Random Forest and Neural Network) that learns enhanced predictions by combining the outputs from the three mono-modal classifiers. We verify the advantage of the proposed method with the state-of-the-art Sussex-Huawei Locomotion and Transportation (SHL) dataset recognizing the eight transportation activities: Still, Walk, Run, Bike, Bus, Car, Train and Subway. We achieve F1 scores of 79.4%, 82.1% and 72.8% with the mono-modal motion, sound and vision classifiers, respectively. The F1 score is remarkably improved to 94.5% and 95.5% by the two data fusion schemes, respectively. The recognition performance can be further improved with a post-processing scheme that exploits the temporal continuity of transportation. When assessing generalization of the model to unseen data, we show that while performance is reduced-as expected-for each individual classifier, the benefits of fusion are retained with performance improved by 15 percentage points. Besides the actual performance increase, this work, most importantly, opens up the possibility for dynamically fusing modalities to achieve distinct power-performance trade-off at run time.
1 st Given Name Surname dept. name of org. (of Aff.) name of organization (of Aff.) City, Country email address 2 nd Given Name Surname dept. name of org. (of Aff.) name of organization (of Aff.) City, Country email address 3 th Given Name Surname dept. name of org. (of Aff.) name of organization (of Aff.) City, Country email address 4 th Given Name Surname dept. name of org. (of Aff.) name of organization (of Aff.) City, Country email address 5 th Given Name Surname department name of organization (of Aff.) dept. name of org. (of Aff.) name of organization (of Aff.) City, Country email address 6 th Given Name Surname department name of organization (of Aff.) dept. name of org. (of Aff.) name of organization (of Aff.) City, Country email address 7 th Given Name Surname dept. name of org. (of Aff.) name of organization (of Aff.) City, Country email address Abstract-Computer vision techniques applied on images opportunistically captured from body-worn cameras or mobile phones offer tremendous potential for vision-based context awareness. In this paper, we evaluate the potential to recognise the modes of locomotion and transportation of mobile users, by analysing single images captured by body-worn cameras. We evaluate this with the publicly available Sussex-Huawei Locomotion and Transportation Dataset, which includes 8 transportation and locomotion modes performed over 7 months by 3 users. We present a baseline performance obtained through crowd sourcing using Amazon Mechanical Turk. Humans infered the correct modes of transportations from images with an F1-score of 52%. The performance obtained by five state-of-the-art Deep Neural Networks (VGG16, VGG19, ResNet50, MobileNet and DenseNet169) on the same task was always above 71.3% F1score. We characterise the effect of partitioning the training data to fine-tune different number of blocks of the deep networks and provide recommendations for mobile implementations.
No abstract
Vision-based human activity recognition can provide rich contextual information but has traditionally been computationally prohibitive. We present a characterisation of five convolutional neural networks (DenseNet169, MobileNet, ResNet50, VGG16, VGG19) implemented with TensorFlow Lite running on three state of the art Android mobile phones. The networks have been trained to recognise 8 modes of transportation from camera images using the SHL Locomotion and Transportation dataset. We analyse the effect of thread count and back-ends services (CPU, GPU, Android Neural Network API) to classify the images provided by the rear camera of the phones. We report processing time and classification accuracy. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding; • Theory of computation → Discrete optimization; • Software and its engineering → Designing software.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.