High computational demand is known to be a technical hurdle for real-time implementation of advanced methods like synthetic aperture imaging (SAI) and plane wave imaging (PWI) that work with the pre-beamform data of each array element. In this paper, we present the development of a software beamformer for SAI and PWI with real-time parallel processing capacity. Our beamformer design comprises a pipelined group of graphics processing units (GPU) that are hosted within the same computer workstation. During operation, each available GPU is assigned to perform demodulation and beamforming for one frame of prebeamform data acquired from one transmit firing (e.g. point firing for SAI). To facilitate parallel computation, the GPUs have been programmed to treat the calculation of depth pixels from the same image scanline as a block of processing threads that can be executed concurrently, and it would repeat this process for all scanlines to obtain the entire frame of image data -i.e. lowresolution image (LRI). To reduce processing latency due to repeated access of each GPU's global memory, we have made use of each thread block's fast-shared memory (to store an entire line of pre-beamform data during demodulation), created texture memory pointers, and utilized global memory caches (to stream repeatedly used data samples during beamforming). Based on this beamformer architecture, a prototype platform has been implemented for SAI and PWI, and its LRI processing throughput has been measured for test datasets with 40 MHz sampling rate, 32 receive channels, and imaging depths between 5-15 cm. When using two Fermi-class GPUs (GTX-470), our beamformer can compute LRIs of 512-by-255 pixels at over 3200 fps and 1300 fps respectively for imaging depths of 5 cm and 15 cm. This processing throughput is roughly 3.2 times higher than a Tesla-class GPU (GTX-275).