Sorted QR decomposition (SQRD) has been extensively adopted for various multipleinput-multiple-output (MIMO) detectors, in which the sorting process incurs severe latency when it comes to larger-scale MIMO situations. This paper proposes a group-SQRD (GSQRD) algorithm to alleviate the latency problem of general SQRD architectures for larger-scale MIMO systems. Via predictively sorting a group of 4 columns at one stage, the GSQRD could eliminate the processing latency by 41% for decomposing 16×16 complex-valued matrices. Additionally, this percentage even rises up to 68% for decomposing 128×128 matrices. To analyse the side effects, the GSQRD is applied in various MIMO detectors in a simulation link, which exhibits a negligible performance degradation for MIMO detection. Moreover, GSQRD is a hardware-friendly algorithm because the division and square root operations in GSQRD are converted to multiplications for simplifying the hardware implementation. Based on this algorithm, two corresponding hardware architectures, which contains 2 and 4 columns respectively in a sorting group, are also implemented with 65-nm CMOS technology. These architectures can work at 513 MHz to decompose 16×16 complex-valued matrices. The processing latencies are respectively 0.32 and 0.26 s, superior to the state-of-art designs.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.