Similarity join has become very important for semi-or un-structured data processing and analysis. Although several studies have been conducted on the similarity join, little attention has been paid to a semi-stream similarity join, which is a similarity join between stream data and a large diskbased relation. In this study, we propose the first distributed solution called DSim-Join for semi-stream similarity join problem. DSim-Join minimizes the data transmission, reduces database accesses using a cache in a distributed stream processing engine, parallelizes join processing, and balances the load between parallel join threads. Experimental results obtained using real-world datasets show that DSim-Join yields significantly improved throughput compared with state-of-the-art methods, especially for large datasets. The results also show that DSim-Join is scalable and stable; it is not very sensitive to the parameters such as the micro-batch interval, checkpoint interval, and similarity threshold. INDEX TERMS semi-stream join, similarity join, distributed stream processing engine, database system, big data, distributed join processing, cache management, Spark Streaming.