Most standard parallel join algorithms try to overcome data skews with a relatively static approach. The way they distribute data (and then computation) over nodes depends on a data re-distribution algorithm (hashing or range partitioning) that is determined before the actual join begins. On the contrary, our approach consists in pre-scanning data in order to choose an efficient join method for each given value of the join attribute. This approach has already proved to be efficient both theoretically and practically in our previous papers. In this paper we introduce a new pipelined version of our frequency adaptive join algorithm. The use of pipelining offers flexible strategies for resource allocation while avoiding unnecessary disk input/output of intermediate join results when computing multi-join queries. We present a detailed version of the algorithm and a cost analysis based on the BSP (Bulk Synchronous Parallel) model, showing that our pipelined algorithm achieves noticeable improvements compared to the sequential parallel version for multi-join queries while guaranteeing perfect balancing properties.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.