e self-join nds all objects in a dataset that are within a search distance, ϵ , of each other; therefore, the self-join is a building block of many algorithms. We advance a GPU-accelerated self-join algorithm targeted towards high dimensional data. e massive parallelism a orded by the GPU and high aggregate memory bandwidth makes the architecture well-suited for data-intensive workloads. We leverage a grid-based, GPU-tailored index to perform range queries. We propose the following optimizations: (i) a trade-o between candidate set ltering and index search overhead by exploiting properties of the index; (ii) reordering the data based on variance in each dimension to improve the ltering power of the index; and (iii) a pruning method for reducing the number of expensive distance calculations. Across most scenarios on real-world and synthetic datasets, our algorithm outperforms the parallel state-of-the-art approach. Exascale systems are converging on heterogeneous distributed-memory architectures. We show that an entity partitioning method can be utilized to achieve a balanced workload, and thus good scalability for multi-GPU or distributed-memory self-joins.us, many large-scale data analytics applications will rely on GPU-e cient algorithms, including the distance similarity self-join for high dimensional data -the subject of this work. is paper makes the following novel contributions:• Leveraging an e cient indexing scheme for the GPU, we exploit the trade-o between index ltering power and search cost to improve the overall performance of searching high dimensional feature spaces.• We improve the ltering power of the index by reordering the data in each dimension using statistical properties of the data distribution. We show that this is particularly important when exploiting the trade-o outlined above.• We mitigate the performance cost of reducing index ltering power by proposing a technique that prunes the candidate set by comparing points based on an un-indexed dimension.• We show that on the worst-case data distribution for our approach, we achieve signi cantly be er performance than the state-of-the-art on the same scenario. is suggests that the performance of the GPU-accelerated self-join is resilient to the data distribution, making the approach well-suited for many application scenarios.• We evaluate our approach on 5 real-world and 3 synthetic datasets and show that our GPU accelerated self-join outperforms the state-of-the-art parallel algorithm in the literature.• e self-join is an expensive operation. We show initial insights into the scalability of the self-join on multi-GPU and distributed-memory systems, and demonstrate that an entity partitioning strategy can be used to achieve good load balancing. e paper is outlined as follows: Section 2 provides background material, Section 3 formalizes the problem and discusses previous work that we employ, Section 4 presents the novel methods we use to improve high dimensional self-join performance, Section 5 illustrates our performance results, Section 6 dis...