Auto-generated captions on YouTube have proven useful in helping viewers better understand the words being spoken. However, at times they fail to contain accurate captions. In these cases, they lead to confusion. The aim of this paper is to identify and analyze errors in the auto-generated captions of 20 commencement speeches on YouTube. These speeches were presented over a period of 12 years by speakers from different walks of life. The researchers selected ten male and ten female icons. Only the first 10 minutes of the speeches were utilized for this investigation. All the captioned errors were collected and analyzed. Upon completion of the analysis, it was discovered that the frequency of errors in each speech ranged between 10 and 46 cases, with an average of one error occurring about every 26 seconds. Among the different error categories, nouns record the highest number with 144 cases (31.3%). The second is verbs with 93 cases (20.2%), then prepositions with 37 cases (8.1%). Among the four subcategories, namely omission, addition, substitution, and word order, substitution recorded the highest amount of errors with 357 cases (77.6%). Furthermore, the errors were classified into two major groups. The first, involving function words, appeared in 169 cases (36.7%). The second, involving content words, appeared in 291 cases (63.3%). The results of this research suggest that a continuous development of the voice recognition software that automatically generates captions is necessary for more efficient and accurate data that will help viewers and listeners better comprehend the video contents.