With the developments in underwater wireless optical communication (UWOC) technology, UWOC can be used in conjunction with autonomous underwater vehicles (AUVs) for high-speed data sharing among the vehicle formation during underwater exploration. A beam alignment problem arises during communication due to the transmission range, external disturbances and noise, and uncertainties in the AUV dynamic model. We propose an acoustic navigation method to guide the alignment process without requiring beam directors, light intensity sensors, and/or scanning algorithms as used in previous research.The AUVs need stably maintain a specific relative position and orientation for establishing an optical link. We model the alignment problem as a partially observable Markov decision process (POMDP) that takes manipulation, navigation, and energy consumption of underwater vehicles into account. However, finding an efficient policy for the POMDP under high partial observability and environmental variability is challenging. Therefore, for successful policy optimization, we utilize the soft actor-critic (SAC) reinforcement learning algorithm together with AUV-specific belief updates and reward shaping based curriculum learning. Our approach outperformed baseline approaches in a simulation environment and successfully performed the beam alignment process from one AUV to another on the real AUV Tri-TON 2.