Document-level contextual information has shown benefits to text-based machine translation, but whether and how context helps endto-end (E2E) speech translation (ST) is still under-studied. We fill this gap through extensive experiments using a simple concatenationbased context-aware ST model, paired with adaptive feature selection on speech encodings for computational efficiency. We investigate several decoding approaches, and introduce inmodel ensemble decoding which jointly performs document-and sentence-level translation using the same model. Our results on the MuST-C benchmark with Transformer demonstrate the effectiveness of context to E2E ST. Compared to sentence-level ST, context-aware ST obtains better translation quality (+0.18-2.61 BLEU), improves pronoun and homophone translation, shows better robustness to (artificial) audio segmentation errors, and reduces latency and flicker to deliver higher quality for simultaneous translation. 1