Multibody system simulations are increasingly complex for various reasons, including structural complexity, the number of bodies and joints, and many phenomena modeled using specialized formulations. In this paper, an effort is pursued toward efficiently implementing the Hamiltonian-based divide-and-conquer algorithm (HDCA), a highly-parallel algorithm for multi-rigid-body dynamics simulations modeled in terms of canonical coordinates. The algorithm is implemented and executed on a system–on–chip platform which integrates a general-purpose CPU and FPGA. The details of the LDUP factorization, which is used in the HDCA approach and accounts for significant computational load, are presented. Simple planar multibody systems with open- and closed-loop topologies are analyzed to show the correctness of the implementation. Hardware implementation details are provided, especially in the context of inherent parallelism in the HDCA algorithm and linear algebra procedures employed for calculations. The computational performance of the implementation is investigated. The final results show that the FPGA–based multibody system simulations may be executed significantly faster than the analogous calculations performed on a general–purpose CPU. This conclusion is a good premise for various model-based applications, including real-time multibody simulation and control.