Docker containers are the de-facto standard to package, distribute, and run applications on cloud-based infrastructures. Commercial providers and private clouds expand their offer with container orchestration engines, making the management of resources and containerized applications tightly integrated. The Storage Group of CERN IT leverages on container technologies to provide ScienceBox: An integrated software bundle with storage and computing services for general purposes and scientific use. ScienceBox features distributed scalable storage, sync&share functionalities, and a web-based data analysis service, and can be deployed on a single machine or scaled-out across multiple servers. ScienceBox has proven to be helpful in different contexts, from High Energy Physics analysis to education for high schools, and has been successfully deployed on different cloud infrastructure and heterogeneous hardware.
The interest in using scalable data processing solutions based on
Apache Hadoop ecosystem is constantly growing in the High Energy Physics
(HEP) community. This drives the need for increased reliability and availability
of the central Hadoop service and underlying infrastructure provided to the
community by the CERN IT department. This paper reports on the overall status
of the Hadoop platform and related Hadoop and Spark service at CERN,
detailing recent enhancements and features introduced in many areas including
the service configuration, availability, alerting, monitoring and data protection,
in order to meet the new requirements posed by the users’ community.
This talk is about sharing our recent experiences in providing data analytics platform based on Apache Spark for High Energy Physics, CERN accelerator logging system and infrastructure monitoring. The Hadoop Service has started to expand its user base for researchers who want to perform analysis with big data technologies. Among many frameworks, Apache Spark is currently getting the most traction from various user communities and new ways to deploy Spark such as Apache Mesos or Spark on Kubernetes have started to evolve rapidly. Meanwhile, notebook web applications such as Jupyter offer the ability to perform interactive data analytics and visualizations without the need to install additional software. CERN already provides a web platform, called SWAN (Service for Web-based ANalysis), where users can write and run their analyses in the form of notebooks, seamlessly accessing the data and software they need. The first part of the presentation talks about several recent integrations and optimizations to the Apache Spark computing platform to enable HEP data processing and CERN accelerator logging system analytics. The optimizations and integrations, include, but not limited to, access of kerberized resources, xrootd connector enabling remote access to EOS storage and integration with SWAN for interactive data analysis, thus forming a truly Unified Analytics Platform. The second part of the talk touches upon the evolution of the Apache Spark data analytics platform, particularly sharing the recent work done to run Spark on Kubernetes on the virtualized and container-based infrastructure in Openstack. This deployment model allows for elastic scaling of data analytics workloads enabling efficient, on-demand utilization of resources in private or public clouds.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.