Archived web data is a great resource for scientific research, but poses serious challenges in data processing and management. We demonstrate the Web Lab Collaboration Server, a platform and service for large-scale collaborative web data analysis in a distributed computing environment, and show how it seamlessly supports non-technical users during search, data extraction and analysis.
We survey three examples of large-scale scientific workflows that we are working with at Cornell: the Arecibo sky survey, the CLEO high-energy particle physics experiment, and the Web Lab project for enabling social science studies of the Internet. All three projects face the same general challenges: massive amounts of raw data, expensive processing steps, and the requirement to make raw data or data products available to users world-wide. However, there are several differences that prevent a one-size-fits-all approach to handling their data flows. Instead, current implementations are heavily tuned by domain and data management experts.We describe the three projects, and we outline research issues into opportunities to integrate Grid technology into these workflows.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.