Numerous benchmark sets exist for combinatorial optimization problems. However, in healthcare scheduling, only a few benchmark sets are known, mainly focused on nurse rostering. One of the most studied topics in the healthcare scheduling literature is surgery scheduling, for which there is no widely used benchmark set. An effective benchmark set should be diverse, reflect the real world, contain large instances, and be extendable. This paper proposes a benchmark set for surgery scheduling algorithms, which satisfies these four criteria. Surgery scheduling instances are characterized by an underlying case mix, which describes the volume and properties of the surgery types. Given a case mix, unlimited random instances can be generated. A complete surgery scheduling benchmark set should encompass the diversity of prevalent case mixes. We therefore propose a case mix classification scheme, which we use to typify both real-life and theoretical case mixes that span the breadth of possible case mix types. Our full benchmark set contains 20,880 instances, with a small benchmark subset of 146 instances. The instances are generated based on real-life case mixes (11 surgical specialties), as well as theoretical instances. The instances were generated using a novel instance generation procedure, which introduces the concept of "instance proximity" to measure the similarity between two instances, and which uses this concept to generate sets of instances that are as diverse as possible.