The key enhancement in the medium access control (MAC) layer is frame aggregation introduced by the IEEE 802.11n/ac standard to accommodate the growing traffic demand in the WLAN by allowing multiple packets aggregated per transmission. Frame aggregation efficiently reduces control overhead in the MAC layer, such as the MAC header and thus it helps to enhance transmission efficiency and throughput performance of WLAN. However, heterogeneous traffic demand among streams in the WLAN downlink MU-MIMO channel creates a challenge to efficiently utilize the benefits of frame aggregation. Transmission efficiency is also compromised during frame size setting determination because when a frame size is larger, the impact of the overhead frame can be lower, but they are also more vulnerable to transmission errors. Thus, this trade-off between maximizing frame size and minimizing overhead frames should be addressed by employing an adaptive frame aggregation technique to derive the optimal frame size that would maximize the throughput in WLAN downlink MU-MIMO channel. Moreover, when frame aggregation approach is employed, more frames must wait before transmission in a buffer which causes a delay in the performance of WLAN. Thus, analysing the trade-off between maximizing throughput and minimizing delay is a critical issue that should also be addressed to enhance the performance of WLAN. However, the majority of the existing adaptive aggregation algorithms in the WLAN downlink MU-MIMO channel are focused to maximize the throughput or minimize the delay. The main contribution of this paper is to propose a machine learning-based frame size optimization algorithm by extending our earlier approach in considering the cost of delay to maximize the system throughput of WLAN. The effectiveness of the proposed scheme is evaluated over the FIFO Baseline Approach and earlier conventional approaches under the effects of various traffic patterns, channel conditions, and the number of STAs.