Although lexical diversity is often used as a measure of productive proficiency (e.g., as an aspect of lexical complexity) in SLA studies involving oral tasks, relatively little research has been conducted to support the reliability and/or validity of these indices in spoken contexts. Furthermore, SLA researchers commonly use indices of lexical diversity such as Root TTR (Guiraud’s index) and D (vocd-D and HD-D) that have been preliminarily shown to lack reliability in spoken L2 contexts and/or have been consistently shown to lack reliability in written L2 contexts. In this study, we empirically evaluate lexical diversity indices with respect to two aspects of reliability (text-length independence and across-task stability) and one aspect of validity (relationship with proficiency scores). The results indicated that neither Root TTR nor D is reliable across different text lengths. However, support for the reliability and validity of optimized versions of MATTR and MTLD was found.