HEPCloud: computing facility evolution for high-energy physics

Panagiotis Spentzouris

Panagiotis Spentzouris

Panagiotis Spentzouris, head of the Scientific Computing Division, wrote this column.

Every stage of a modern high-energy physics (HEP) experiment requires massive computing resources, and the investment to deploy and operate them is significant. For example, worldwide, the CMS experiment uses 100,000 cores. The United States deploys 15,000 of these at the Fermilab Tier-1 site and another 25,000 cores at Tier-2 and Tier-3 sites. Fermilab also operates 12,000 cores for muon and neutrino experiments, as well as significant storage resources, including about 30 petabytes of disk and 65 petabytes of tape, served by seven tape robots and fast and reliable networking.

And the needs will only grow from there. During the next decade, the intensity frontier program and the LHC will be operating at full strength, while two new programs will come online around 2025: DUNE and the High-Luminosity LHC. Just the increased event rates and complexity of the HL-LHC will push computing needs to approximately 100 times more than current HEP capabilities can handle, generating exabytes (1,000 petabytes) of data!

HEP must plan now on how to efficiently and cost-effectively process and analyze these vast amounts of new data. The industry trend is to use cloud services to reduce the cost of provisioning and operating, provide redundancy and fault tolerance, rapidly expand and contract resources (elasticity), and pay for only the resources used. Adopting this approach, U.S. HEP facilities can benefit from incorporating and managing “rental” resources, achieving the “elasticity” that satisfies demand peaks without overprovisioning local resources.

The HEPCloud facility concept is a proposed path to this evolution, envisioned as a portal to an ecosystem of computing resources, commercial and academic. It will provide “complete solutions” transparently to all users with agreed-upon levels of service, routing user workflows to local (owned) or remote (rental) resources based on efficiency, cost, workflow requirements and target compute engine policies. This concept takes to the next level current practices implemented “manually” (for example, balancing the load between intensity frontier and CMS local computing at Fermilab and using sites of the Open Science Grid). HEPCloud could provide the means to share resources in the ecosystem, potentially linking all U.S. HEP computing.

In order to demonstrate the value of the approach and better understand the necessary effort, in consultation with the DOE Office of High Energy Physics, we recently started the Fermilab HEPCloud project. The goal is to integrate “rental” resources into the current Fermilab facility in a manner transparent to the user. The project aims to develop a seamless user environment, architecture, middleware and policies for efficient and cost-effective use of different resources, as well as information security policies, procedures and monitoring.

The first type of external resource being implemented is the commercial cloud from Amazon. Working with the experiments, we have identified use cases from CMS, DES and NOvA to demonstrate necessary key aspects of the concept. The project, led by Robert Kennedy and Gabriele Garzoglio (project managers) and Anthony Tiradani (technical lead), is making excellent progress, preparing to deploy the first use case (CMS simulation), working closely with CMS and Open Science Grid. Expected to go into production in the next couple of months, it will provide information on scalability, availability and cost-effectiveness.

High-performance computing facilities present an appealing possibility to be considered as a potential external resource for HEPCloud. The capacity of current supercomputers operated by the DOE Office of Advanced Scientific Computing Research is more than 10 times larger than the total of HEP computing needs. The plans for new supercomputers aim to continue this trend. Work is under way by HEP experiments to identify use cases for such use within the constraints of allocation, security and access policy of high-performance computing facilities. We are actively approaching experts from these facilities to collaborate on this.

HEP computing has been very successful in providing the means for great physics discoveries. As we move into the future, we will work to develop the facilities to enable continued discovery — efficiently and cost-effectively. We believe the HEPCloud concept is a good candidate!