In medical diagnosis, ultrasound (US) imaging is one of the most common, safe, and powerful techniques. Volumetric (3D) US is potentially very attractive, compared to 2D US, because it might enable telesonography -decoupling the local image acquisition, by an untrained person, and the diagnosis, by the trained sonographer, who can be remote. Unfortunately, current 3D systems are hospital-oriented, bulky and expensive, and they cannot be available in emergency operations or rural areas. This motivates us to develop a portable US platform with cheap, battery-operated, more efficient electronics.The core of any US imaging system is the beamforming (BF), which is the most computationally challenging and materially expensive step. BF consists of delay calculation and apodization. For each volume location, to identify whether it comprises fully reflective (white voxel) or non-reflective (black voxel) tissue, it is first necessary to compute the twoway traveling delay of the sound wave from the sound origin to this location and back to each piezoelectric element on the transducer. Apodization is a weighting used to eliminate side lobes arising due to the transducer's directivity function. Typically, apodization can be performed with a Hanning function, whose bell profile smoothly attenuates sensitivity towards the transducer edges. The width of the apodization profile can also expand with the imaging depth, optimizing resolution and minimizing clutter at all depths. Different systems, either commercial or research-based [1], have dealt with the processing demands of 3D BF by reducing the number of receive channels, which simplifies computation, but sacrifices resolution. To date, there is no satisfactory answer for a portable, low-power, low cost 3D US imaging system that still has the capability to process high-channel-count, or even full-resolution, probe readouts, for better resolution and contrast.We have previously proposed an approach [2] to more efficiently calculate delays. Instead of attempting to compute trillions of square roots per second, this method simply calculates a small reference set of delays (a few square roots produced by a Xilinx CORDIC IP), followed by, leveraging geometric considerations, the application of two additions per delay sample. In this paper we show a scalable beamformer architecture capable of supporting over 1024 transducer elements in a single, latest-generation FPGA. Fig. 1 shows the whole FPGA system including our beamformer custom block. The latter communicates via an AXI interface. The overall system includes a MicroBlaze processor subsystem and an Ethernet interface that is presently used for all I/O. The proposed architecture of the beamformer is shown in Fig. 2. Table I shows results of the proposed beamformer architecture including the resource utilization for reconstructing a 2.5M-voxel volume, using 4MHz center frequency, and 32MHz sampling frequency, supporting a 32×32 elements probe. The results show that a theoretical reconstruction rate of 50 volumes/s can be...