Combining deep speaker embedding (window-level d-vector) extraction systems using 2D self-attentive, gated additive and bilinear pooling methods. The best-performing structure for combination is obtained by stacking a 2D self-attentive and a bilinear pooling structures.• A complete single pass neural network-based diarisation pipeline is introduced, which includes neural voice activity detection, neural change point detection, a deep speaker embedding extraction system and spectral clustering.• Experiments on both AMI and NIST RT05 evaluation sets showed that our proposed methods can produce state-ofthe-art results for the very challenging multi-speaker (with 4-10 speakers) meeting diarisation task.• We use the AMI dataset based on the official speech recognition partition with the audios recorded by multiple distance microphones (MDM) since it is a more realistic setup for meeting transcription than many different setups used by previous AMI based studies. This is the first paper to use this setup to the best of the authors' knowledge. Although this makes our results not directly comparable to those from the previous papers, we think our system still shows superior performance, since the results shown in Table 11 are the lowest diarisation error rates with the same training data, and the realistic setup we used can increase the difficulty of the task.• This paper is made based on our previous conference paper (Sun et al., 2019). Compared to our previous paper -Almost all of the combination structures are newly proposed and not included in (Sun et al., 2019), apart from the first type of 2D self-attentive method defined in Eqn. (4).-Sun et al. ( 2019) only performed the experiments on the AMI data with manual segmentation, while this paper explores to use both manual and automatic segmentation. The neural diarisation pipeline is newly introduced in this paper, although the neural VAD structure was proposed in (Wang et al., 2016).