Mu2e’s opportunistic run on the Open Science Grid

To conduct full event simulations, Mu2e requires time on more than one computing grid. This graphic shows, by the size of each area, the fraction of the recent Mu2e simulation production through Fermigrid, the University of Nebraska, CMS computing at Caltech, MIT and Fermilab, the ATLAS Midwest and Michigan Tier-2s (MWT2 and AGLT2), Syracuse University (SU-OSG). and other sites — all accessed through the Open Science Grid.

Scientists in Fermilab’s Mu2e collaboration are facing a challenging task: In order to get DOE approval to build their experiment and take data, they must scale up the simulations used to design their detector.

Their aim is to complete this simulation campaign, as they call it, in time for the next DOE critical-decision review, which Mu2e hopes will give the green light to proceed with experiment construction and data taking. The team estimated that they would need the computing capacity of about 4,000 CPUs for four to five months (followed by a much smaller need for the rest of the year). Because of the large size of the campaign and the limited computing resources at Fermilab, which are shared among all the lab’s experiments, the Mu2e team adapted their workflow and data management systems to run a majority of the simulations at sites other than Fermilab. They then ran simulations across the Open Science Grid using distributed high-throughput computer facilities.

Mu2e scientist Andrei Gaponenko explained that last year, Mu2e used more than their allocation of computing by using any and all available CPU cycles not used by other experiments locally on FermiGrid. The experiment decided to continue this concept on the Open Science Grid, or OSG, by running “opportunistically” on as many available remote computing resources as possible.

“There were some technical hurdles to overcome,” Gaponenko said. Not only did the scripts have to be able to see the Mu2e software, but also all of the remote sites — more than 25 — had to be able to run this software, which was originally installed at Fermilab. Further, the local operating system software needed to be compatible.

“A lot of people worked very hard to make this possible,” he said. Members of the OSG Production Support team helped support the endeavor — getting Mu2e authorized to run at the remote sites and helping debug problems with the job processing and data handling. Members of the Scientific Computing Division supported the experiment’s underlying scripts, software and data management tools

The move to use OSG proved valuable, even with the inevitable hurdles of starting something new.

“As Mu2e experimenters, we are pilot users on OSG, and we are grabbing cycles opportunistically whenever we can. We had issues, but we solved them,” said Rob Kutschke, Mu2e analysis coordinator. “While we did not expect things to work perfectly the first time, very quickly we were able to get many hundreds of thousands of CPU hours per day.”

Ray Culbertson, Mu2e production coordinator, agreed.

“We exceeded our baseline goals, met the stretch goals and will continue to maintain schedule,” Culbertson said.

Ken Herner, a member of the support team in the Scientific Computing Division that helped the experimenters port their applications to OSG, hopes that Mu2e will serve as an example for more experiments that currently conduct their event processing computing locally at Fermilab.

“The important thing is demonstrating to other experiments here that it can work and it can work really well,” Herner said. “Ideally, this sort of running should become the norm. What you really want is to just submit the job, and if it runs on site, great. And if it runs off site, great — just give me as many resources as possible.”

Hanah Chang