Underwater acoustic target detection with multiple autonomous underwater vehicles (AUVs) has the advantage of scalable aperture, however, it depends on the accurate estimation of the shape and structure of the swarm. In this study, a method based on an active sound source is proposed for geometry estimation of distributed nodes. Pulses or chirp signals, transmitted by the source, arrive AUV nodes with different time delays and are collected by the hydrophones, from which the formation geometry could be estimated. Firstly, the frequency-sliding generalised cross-correlation (FS-GCC) feature matrix between the signals received by each node and the reference node is extracted. Secondly, an autoencoder network is employed to enhance the time delay line implied in the FS-GCC matrix, from which a time delay between two nodes is obtained. A time delay vector could be constructed by concatenating the delays between the reference node and all other nodes when the source lies at different orientations. Lastly, the time delay matrix, which is composed of time delay vectors in a time window when AUV circularly moves around the detection nodes, is further decomposed by singular value decomposition to obtain the formation geometry. The feasibility and effectiveness are verified by the simulation dataset, anechoic pool experiment, and lake trial. The root mean square error of the proposed geometry estimation method drops by 9% and 54% for two typical geometries when compared with ref.[1] for the simulation dataset, which becomes 45% and 37% for the anechoic pool experiment. Though the estimation error increases in lake trials, the proposed method achieved a relatively better precision. It is found that the proposed method has better performance when the formation geometry of AUVs approaches a regular polygon and when the source works in a stop-and-go mode. Furthermore, when the source distance meets the far-field assumption, the azimuthal span of source movement relative to AUVs should be larger than 30°.