Human-to-human communication is no longer just mediated by computers, it is increasingly generated by them, including on popular communication platforms such as Gmail, Facebook Messenger, Linkedin, and others. Yet, little is known about the differences between human- and machine-generated responses in complex social settings. Here, we present EnronSR, a novel benchmark dataset that is based on the Enron email corpus and contains both naturally occurring human- and AI-generated email replies for the same set of messages. This resource enables the benchmarking of novel language-generation models in a public and reproducible manner, and facilitates a comparison against the strong, production-level baseline of Google Smart Reply used by millions of people. Moreover, we show that when language models produce responses they could align more closely with human replies in terms of when responses should be offered, their length, sentiment, and semantic meaning. We further demonstrate the utility of this benchmark in a case study of GPT-3, showing significantly better alignment with human responses than Smart Reply, albeit providing no guarantees for quality or safety.