Internet of Things and deep learning (DL) are merging into one domain and enabling outstanding technologies for various classification tasks. Such technologies are based on complex networks that mainly target powerful platforms with rich computing resources, such as servers. Therefore, for resource-constrained embedded systems, new challenges of size, performance (i.e., latency, throughput, and accuracy), and power consumption-aware networks should be addressed, particularly when edge devices handle multimodal data (i.e., different types of real-time sensing data). In this case study, we focus on DeepSense, a time-series multimodal DL framework combining convolutional neural networks (NN) and recurrent NN to process accelerometer and gyroscope data for human activity recognition. We present a field-programmable gate array (FPGA)-based acceleration for DeepSense incorporated into a hardware/software co-design approach to achieve better latency and energy efficiency using the Xilinx Vitis AI framework. The architecture of DeepSense has drawbacks that cannot be easily alleviated by Vitis AI; therefore, we introduced a new methodology of adjusting the framework and its components (i.e., the deep learning processing unit (DPU)) to achieve a custom design suitable for such time-series multimodal NN. We implemented the accelerator on two FPGA boards and performed a quantitative evaluation by varying the DPU parameter settings to support our design approach. We demonstrated the effectiveness of our implementation against the original software implementation on mobile devices by achieving up to 2.5× and 5.2× improvement in latency and energy consumption, respectively. Through this case study, we provide crucial insights into the FPGAbased accelerator design of multimodal NN and essential aspects to consider for further improvements and adaptation in other application domains.