Deep learning (DL) has gained increasing prominence in latency-critical artificial intelligence (AI) applications. Due to the intensive computational requirements of these applications, cloud-centric approaches have been attempted to address this issue, but they result in intolerable latency, network congestion, and privacy concerns. An alternative concept called edge intelligence, which combines AI and edge computing, has been proposed to perform DL execution at the edge in multiple resource-constrained devices (RCDs) collaboratively. This paper proposes a relay-assisted, distributed, and collaborative on-device convolutional neural network (CNN) execution scheme for latency-critical applications. The system employs hybrid parallelism, combining both data and model parallelism, to optimize collaborative CNN execution on RCDs. A relay-assisted communication technique is used to reduce the input data size per RCD and avoid excessive point-to-point communication between the data owner RCD and collaborative RCDs. The proposed approach reduces communication overhead using two strategies: layer block formation and optimal filter assignment. These strategies are applied to multiple collaborative RCDs, considering their different computing capabilities and network conditions. Finally, a convex optimization problem is formulated to minimize the overall energy consumption by jointly optimizing the workload of each RCD in each layer and communication and computation parameters.