Users interact with many electronic devices via menus such as auditory or visual menus. Auditory menus can either complement or replace visual menus. We investigated how advanced auditory cues enhance auditory menus on a smartphone, with tapping, wheeling, and flicking input gestures. The study evaluated a spindex (speech index), in which audio cues inform users where they are in a menu; 122 undergraduates navigated through a menu of 150 songs. Study variables included auditory cue type (text-to-speech alone or TTS plus spindex), visual display mode (on or off), and input gesture (tapping, wheeling, or flicking). Target search time and subjective workload were lower with spindex than without for all input gestures regardless of visual display mode. The spindex condition was rated subjectively higher than plain speech. The effects of input method and display mode on navigation behaviors were analyzed with the two-stage navigation strategy model. Results are discussed in relation to attention theories and in terms of practical applications.