Spreadsheets are a vital tool for end-user data management. Using large language
models for formula authoring assistance in these environments can be difficult,
as these models are expensive to train and challenging to deploy due to their
size (up to billions of parameters). We present FLAME, a transformer-based model
trained exclusively on Excel formulas that leverages domain insights to achieve
competitive performance while being substantially smaller (60M parameters) and
training on two orders of magnitude less data. We curate a training dataset
using sketch deduplication, introduce an Excel-specific formula tokenizer, and
use domain-specific versions of masked span prediction and noisy auto-encoding
as pre-training objectives. We evaluate FLAME on formula repair, formula
completion, and similarity-based formula retrieval. FLAME can outperform much
larger models, such as the Davinci (175B) and Cushman (12B) variants of Codex
and CodeT5 (220M), in 10 of 14 evaluation settings for the repair and completion
tasks. For formula retrieval, FLAME outperforms CodeT5, CodeBERT, and
GraphCodeBERT.