Arbitrary shape detection, compared with analytic shape detection, plays a more significant role in machine vision for industrial automation. With the development of industrial automation, the requirements for low-delay detection and high-precision operation are gradually increasing. However, existing works on arbitrary shape detection pay more attention to detection accuracy, but few researchers attempt to achieve ultra-low detection delay, because of the limitation of the bandwidth between memory to CPU. This paper proposes clustering relative-vectors-based parallelization and temporal constraint for generalized Hough transform (GHT) algorithm compression to achieve the ultra-low delay process system, implemented on FPGA. By clustering relative vectors among closed edge pixels as a clustered vector, and defining a regularized R-Table structure, the parallelization of GHT has been increased. Moreover, fully utilizing the temporal information in high frame rate video leads to the compression of accumulator memory consumption, by confining the search widow and restricting the rotation range according to the detection result from the previous frame. The evaluation shows that the proposed architecture finishes the detection in VGA sized sequence with an ultra-low process delay of 1.851ms per frame.