In April 2021, a team of four developers from Fermilab participated in a four-day GPU Hackathon event hosted by the Argonne Leadership Computing Facility, or ALCF. Their goal: to use the computational speed of the latest generation of graphic processing units to optimize the reconstruction of particle tracks.
In computer science, Moore’s law states that the speed and performance of computer chips — tied to the number of transistors in a dense integrated circuit — doubles about every two years. Denser chips reduce transmission delays and hence translate to faster computer-processing-unit, or CPU, clock speed. Increases in density brought faster and faster CPUs to the market for almost 30 years, from 1975 until around 2005. At that time, the transistor size became so small that providing sufficient power to the chip without overheating it became challenging.
To further improve performance, researchers began developing new types of chips featuring multiple processing units — or cores — that operate in parallel. Graphics processing units, or GPUs, are one such example and feature thousands of cores. Nvidia, a partner in the GPU Hackathon at Argonne, has been developing GPUs since the 1990s.
In fact, long before CPUs’ clock speed plateaued, the demand for better and faster computer graphics in video games and other applications had led to the development of powerful GPUs. Graphics rendering, which requires computing what color to display for every single pixel on a monitor, with say, 3 million pixels, is an inherently massive computing task that can be parallelized. It benefits from having a dedicated hardware that executes the same operation (code) on different input (data) in parallel.
For many years, researchers also have been pursuing the use of GPUs to speed up scientific computations. Fermilab, for example, installed a cluster of GPUs in its Grid Computing Center to speed up lattice QCD calculations in 2011.
The CMS collaboration has been exploring how to take advantage of GPUs to meet its future computing challenges, particularly in the High-Luminosity Large Hadron Collider era. In the last year of the LHC Run 2 that ended in 2018, the CMS experiment recorded about 16 petabytes of raw data from the detector. When the HL-LHC is operational, it will create multiple hundreds of petabytes of detector data from particle collisions per year.
Adapting scientific applications to use GPUs is not trivial. It often requires radically redesigning algorithms to take advantage of the thousands of computing cores in the GPUs and optimizing them through detailed profiling.
The GPU Hackathon programs, launched in 2015, led by national lab-university-industry partnerships, help researchers and developers accelerate and optimize their applications on GPUs. At the event hosted by the ALCF, 13 teams of researchers from various scientific domains, including aerodynamics, climate modeling, physics, molecular dynamics and neuroscience, participated. Each was given access to the newly improved ThetaGPU system at the ALCF, with exclusive access to state-of-the-art Nvidia A100 GPUs.
Despite being converted to a fully virtual event over Zoom, the four-day hackathon provided valuable interactions with expert mentors from the ALCF as well as software engineers from Nvidia. The teams spent most working hours with their assigned mentors, bouncing around ideas for improvement and implementing them. At the end of day, each team would give a five-minute progress report and share any findings that could be useful for the other teams.
The Fermilab team comprised researchers from the Scientific Computing Division and the CMS Department: Martin Kwok, Matti Kortelainen, Giuseppe Cerati and Alexei Strelchenko. Their objective? To optimize a prototype application for CMS particle track reconstruction using GPUs.
Determining the trajectories of charged particles from their energy deposits in the CMS detector is the most computationally complex and time-consuming part of the CMS data processing chain. Over the last five years, the so-called mkFit project has developed a new track reconstruction code for CMS that is six times faster than the current standard. The new code, which is being integrated into CMS software targeting first use in LHC Run 3, successfully uses vectorized and parallelized Kalman filter algorithms to speed up the track reconstruction.
To potentially further increase speed-ups, physicists and computer scientists are exploring the possibility of generalizing mkFit to work on GPUs. To experiment with GPU programming models available on the market, the Fermilab team extracted the track propagation part of the code to a lightweight mini-app named p2r, which propagates tracks in the radial direction.
During the hackathon, our team had in-depth discussions with our assigned mentor from Nvidia, who helped profile the initial implementation of p2r and identify the bottlenecks and sources of inefficiencies.
By the end of the hackathon event, our team managed to achieve a factor of five speed-up compared to the current standard, almost as good as the vectorized CPU implementation of the mkFit code. Breaking even with that vectorized version is no easy feat for GPU programming, since the extra time needed to transfer input data to and output data from the GPU adds significant drag on the overall speed. Maximizing computation efficiency on the GPUs so that the additional data-transfer time is less than the time gained using multicores is a challenge many GPU applications confront.
During the event, our team also implemented the first version of p2r using parallel C++ standard library algorithms. Parallel STL is one of the options being explored at the DOE-funded High-Energy Physics Center for Computational Excellence to provide portable programming models across different architectures.
The multiday hackathon event provided an exciting opportunity to work with experts from the ALCF and industry in a collaborative environment, accelerating progress in our project in a very short time. Building on that success, we are now exploring the implementation of the p2r code using several other portable programming models. This will allow us to compare them and choose a suitable model when developing a new GPU application.
Martin Kwok is a Fermilab research associate in the CMS department and the Scientific Computing Division.
CMS Department communications are coordinated by Fermilab scientist Pushpa Bhat.