Encouraged by the success of the first EGEE biomedical data challenge against malaria (WISDOM), the second data challenge battling avian flu was kicked off in April 2006 to identify new drugs for the potential variants of the influenza A virus. Mobilizing thousands of CPUs on the Grid, the six-week-long high-throughput screening activity has fulfilled over 100 CPU years of computing power and produced around 600 gigabytes of results on the Grid for further biological analysis and testing. In the paper, we demonstrate the impact of a worldwide Grid infrastructure to efficiently deploy large-scale virtual screening to speed up the drug design process. Lessons learned through the data challenge activity are also discussed.
Large scale grids for in silico drug discovery open opportunities of particular interest to neglected and emerging diseases.\ud
In 2005 and 2006, we have been able to deploy large scale virtual docking within the framework of the WISDOM\ud
initiative against malaria and avian influenza requiring about 100 years of CPU on the EGEE, Auvergrid and TWGrid\ud
infrastructures. These achievements demonstrated the relevance of large scale grids for the virtual screening by molecular\ud
docking. This also allowed evaluating the performances of the grid infrastructures and to identify specific issues raised by\ud
large scale deployment
Abstract-MapReduce is a widely used data-parallel programming model for large-scale data analysis. The framework is shown to be scalable to thousand of computing nodes and reliable on commodity clusters. However, research has shown that there is room for performance improvement of the MapReduce framework. One of the main performance bottlenecks is caused by the all-to-all communication between mappers and reducers, which may saturate the top-of-rack switch and inflate job execution time. Reducing cross-rack communication will improve job performance. In current MapReduce implementation, the task assignment is based on the pull-model, in which cross-rack traffic is difficult to control. In contrast, the MapReduce framework allows more flexibility in assigning reducers to the computing nodes.In this paper, we investigate the reducer placement problem (RPP), which considers the placement of reducers to minimize cross-rack traffic. We devise two optimal algorithms to solve RPP and implement the algorithms in the Hadoop system. We also propose an analytical solution for this problem. Our experiment results with a set of MapReduce applications show that our optimization achieves 9% to 32% performance improvement compared with the unoptimized Hadoop.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.