An unsung hero of particle physics: GlideinWMS

GlideinWMS manages computing resources for researchers on the Open Science Grid. Image: Amanda Solliday

There’s a computing program that works under the radar for many particle physics experiments, one that makes life easier for researchers dealing with large amounts of data.

First developed at Fermilab by Igor Sfiligoi, the Glidein Work Management System works much like an air traffic controller by organizing data that land at global computing centers.

“Our goal is to make the process as simple as possible for scientists,” said Burt Holzman, head of CMS computing facilities at Fermilab and GlideinWMS project manager.

Scientists from all over the world send information about jobs — large packets of scientific analyses — through network connections to one of the four Glidein “factories.” These factories are located at Fermilab, CERN, UC San Diego and Indiana University.

The system processes on the order of hundreds of millions of jobs each year. In a given day, upwards of 200,000 computing jobs from the CMS experiment at CERN pass through the Glidein system.

A GlideinWMS factory tracks down open space at grid sites, which are computing centers that can easily process and store big data. The Open Science Grid ties these centers together and gives scientists access to the computing resources they need. Hundreds of computing centers are connected globally through the system. The Department of Energy and the National Science Foundation jointly fund the Open Science Grid.

The Glidein system can ship large packets of data into the grid in seconds if slots at the computing centers are open. With its high-performance computing capability, resources can be distributed to different jobs in order to run more efficiently. That kind of efficiency is invaluable for experiments that process vast stores of data. The entire duration for a job is typically hours.

Without the factories, the system was “an unstable beast” and riddled with user or equipment errors, Holzman said. The factory system insulates researchers from system failures, which means experimenters don’t need to worry about jobs crashing. If there is a problem, the factory quickly reroutes the job to a different open slot.

The Glidein software works with HTCondor, a resource management system developed at the University of Wisconsin-Madison. The system uses HTCondor to makes nodes — resources at large computing clusters — appear local to the user. Related programs share bird-related names such as Parrot and Chirp, and Glidein follows this naming tradition. The user “glides in” like a bird to the Open Science Grid. Recently, the reach of the system expanded to include clouds such as the Amazon Elastic Compute Cluster and OpenStack.

Anyone who is on the Open Science Grid may use Glidein. Use of the software extends beyond particle physics to other disciplines such as structural biology and neurology.

The latest goal for the system is to build on its ability to send information to the cloud and take advantage of an even larger network of computing power.