High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mobile data management. Nowadays, performing HDSJs efficiently faces two challenges. First, the scale of datasets is increasing rapidly, making parallel computing on a scalable platform a must. Second, the dimensionality of the data can be up to hundreds or even thousands, which brings about the issue of dimensionality curse. In this paper, we address these challenges and study how to perform parallel HDSJs efficiently in the MapReduce paradigm. Particularly, we propose a cost model to demonstrate that it is important to take both communication and computation costs into account as dimensionality and data volume increases. To this end, we propose DAA (Dimension Aggregation Approximation), an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs. Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality. We perform extensive experiments using both synthetic and real datasets to evaluate the speedup and the scaleup of our algorithms.
As a leading cause of severe disability and death, stroke places an enormous burden on Chinese society. A nationwide stroke screening platform called CSDC (China Stoke Data Center) has been built to support the national stroke prevention program and stroke clinical research since 2011. This platform is composed of a data integration system and a big data analysis system. The data integration system is used to collect information on risk factors, diagnosis history, treatment, and sociodemographic characteristics and stroke patients' EMR. The big data analysis system support decision making of stroke control and prevention, clinical evaluation and research. In this paper, the design and implementation of CSDC are illustrated, and some application results are presented. This platform is expected to provide rich data and powerful tool support for stroke control and prevention in China.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.