The mesh-based Monte Carlo (MMC) algorithm is increasingly used as the gold-standard for developing new biophotonics modeling techniques in 3-D complex tissues, including both diffusion-based and various Monte Carlo (MC) based methods. Compared to multi-layered and voxel-based MCs, MMC can utilize tetrahedral meshes to gain improved anatomical accuracy, but also results in higher computational and memory demands. Previous attempts of accelerating MMC using graphics processing units (GPUs) have yielded limited performance improvement and are not publicly available. Here we report a highly efficient MMC -MMCL -using the OpenCL heterogeneous computing framework, and demonstrate a speedup ratio up to 420x compared to state-of-the-art singlethreaded CPU simulations. The MMCL simulator supports almost all advanced features found in our widely disseminated MMC software, such as support for a dozen of complex source forms, wide-field detectors, boundary reflection, photon replay and storing a rich set of detected photon information. Furthermore, this tool supports a wide range of GPUs/CPUs across vendors and is freely available with full source codes and benchmark suites at http://mcx.space/#mmc. c 2019.