As more organizations are leveraging third-party cloud and edge data centers to process data efficiently, the issue of preserving data confidentiality becomes increasingly important. In response, numerous security mechanisms have been introduced and promoted in recent years including software-based ones such as homomorphic encryption, as well as hardware-based ones such as Intel SGX and AMD SEV. However these mechanisms vary in their security properties, performance characteristics, availability, and application modalities, making it hard for programmers to judiciously choose and correctly employ the right one for a given data query.
This paper presents a mechanism-independent approach to distributed confidentiality-preserving data analytics. Our approach hinges on a core programming language which abstracts the intricacies of individual security mechanisms. Data is labeled using custom confidentiality levels arranged along a lattice in order to capture its exact confidentiality constraints. High-level mappings between available mechanisms and these labels are captured through a novel expressive form of security policy. Confidentiality is guaranteed through a type system based on a novel formulation of noninterference, generalized to support our security policy definition. Queries written in a largely security-agnostic subset of our language are transformed to the full language to automatically use mechanisms in an efficient, possibly combined manner, while provably preserving confidentiality in data queries end-to-end. We prototype our approach as an extension to the popular Apache Spark analytics engine, demonstrating the significant versatility and performance benefits of our approach over single hardwired mechanisms --- including in existing systems --- without compromising on confidentiality.