A central building block of many quantum algorithms is the diagonalization of Pauli operators. Although it is always possible to construct a quantum circuit that simultaneously diagonalizes a given set of commuting Pauli operators, only resource-efficient circuits are reliably executable on near-term quantum computers. Generic diagonalization circuits can lead to an unaffordable Swapgate overhead on quantum devices with limited hardware connectivity. A common alternative is excluding two-qubit gates, however, this comes at the cost of restricting the class of diagonalizable sets of Pauli operators to tensor product bases (TPBs). In this letter, we introduce a theoretical framework for constructing hardware-tailored (HT) diagonalization circuits. We apply our framework to group the Pauli operators occurring in the decomposition of a given Hamiltonian into jointly-HT-diagonalizable sets. We investigate several classes of popular Hamiltonians and observe that our approach requires a smaller number of measurements than conventional TPB approaches. Finally, we experimentally demonstrate the practical applicability of our technique, which showcases the great potential of our circuits for near-term quantum computing.