CMS explores new technologies for big-data analysis

Fermilab operates the world's largest CMS Tier-1 facility. It provides 115 petabytes of data storage, grid-enabled CPU resources and high-capacity network to other centers. Photo: Reidar Hahn

Fermilab operates the world’s largest CMS Tier-1 facility. It provides 115 petabytes of data storage, grid-enabled CPU resources and high-capacity network to other centers. Photo: Reidar Hahn

Data science is one of the world’s fastest growing industries, and as a consequence, a large ecosystem of software tools to enable data mining at ever-increasing scales has emerged. Most of the large-scale data analysis techniques that high-energy physics (HEP) researchers pioneered have been either rediscovered or reimagined as commercial data processing applications reach (and often surpass) the scale of LHC data.

Luckily, most popular data science software platforms are open-source, and development is driven by a wide range of community needs — including those in other fields of science. A growing community of HEP researchers, including many working at Fermilab, are leveraging these open-source tools to improve productivity and simplify HEP analysis workflows.

To date, the CMS experiment has generated more than 100 petabytes of raw data. Large-scale processing campaigns distill the raw data into analysis data sets with typical sizes around 10 terabytes. Even this reduced data is still unwieldy for HEP researchers to analyze, taking from hours to days to process with traditional systems. Analyzers must iteratively explore the data, designing many different queries on a daily basis. Hence, any mechanism that reduces this time-to-insight will provide significant gains in productivity. The High-Luminosity LHC, which will produce 60 times as much data, will make this issue even more pressing. Now is the time to find innovative solutions to this big-data challenge, which will be the largest such challenge in HEP for decades to come.

One of the main bottlenecks for analysis workflows is efficient data delivery to the processors. Computing researchers at Fermilab have designed a data delivery system based on the same high-performance database and caching software tools widely used in industry to, for example, serve Twitter feeds. This system, known as Striped, can process terabytes of data in minutes. Researchers are also evaluating Apache Spark to deliver data to real-world analysis workflows, finding similarly impressive performance.

One noticeable trend in data science software libraries is the preference for a more declarative model for user code: Data scientists write code that expresses what operations they want done, but not necessarily how those operations are carried out, and therefore leave the how to the software libraries. This typically results in faster code, as software engineers can aggressively optimize the underlying libraries. This also typically results in more readable code, as the flow of operations is less clouded by their detailed implementation. An upcoming workshop hosted at Fermilab is dedicated to exploring the possibility of constructing a purely declarative HEP analysis description language.

It is not clear yet if a formal language can completely describe any HEP analysis. An alternative syntax, known as array programming, may offer more flexibility. Traditional HEP analysis code involves a programming construct known as an event loop. In an event loop, variables of interest for each collision event are loaded in turn, and several manipulations are performed within. In contrast, a novel columnar approach loads arrays of variables (columns) spanning several events (rows) and applies array programming expressions to these columns, eschewing a single outer loop in favor of multiple implicit inner loops. Such columnar expressions lead to a more declarative programming model. This approach also has performance advantages on modern computers. A good analogy is a vehicle assembly line, where doing simple repetitive operations in turn improves overall productivity.

An effort led by Fermilab researchers is currently underway to use these novel approaches to complete two full CMS analyses. The tools developed in the course of this project form the Coffea library, which makes extensive use of the scientific Python ecosystem centered around the concepts of efficient data delivery and columnar analysis. The Coffea project will validate whether these industry tools for data delivery are applicable to HEP use cases and whether columnar analysis is a viable paradigm.

This work was presented in the context of a broader Python in HEP initiative, which has received enthusiastic support from the community. These open-source projects are very collaborative, and we welcome any interested reader to join!

Nick Smith is a postdoctoral research associate in the Fermilab CMS Department.

CMS Department communications are coordinated by Fermilab scientist Pushpa Bhat.