Spoken language variation analysis is increasingly considered in multimodal settings combining knowledge from computer, human and social sciences. This work focuses on second language (L2) acquisition via the study of linguistic variation combined with eye-tracking measures. Its goal is to model L2 pronunciation, to understand and to predict through AI techniques the related metacognitive information concerning reading strategies, text comprehension and L2 level. We present an experimental protocol involving a reading aloud setup, as well as first data collection to gather L2 speech with associated eye-tracking measures.