With big data, Fermilab plays a big role

Fermilab has been dealing with tremendous amounts of data for years using systems like these at the Grid Computing Center. Photo: Reidar Hahn

With computing power and capabilities expanding and evolving at a rapid pace, the world has been producing exceedingly vast amounts of data. Annual worldwide computer use produces zettabytes of information – that’s a trillion gigabytes – and that figure is growing rapidly. The U.S. Office of Science and Technology Policy recently announced the Big Data Research and Development Initiative, aimed at making such enormous quantities of data more useful to researchers, businesses and policymakers.

“Data in and of itself isn’t very useful unless you have the tools to find it, read it, understand it, and make good use of it,” said Rob Roser, head of the Scientific Computing Division at Fermilab.

Since its founding, Fermilab has been working with large amounts of data. Over 30 petabytes of information, or 30 million gigabytes, are stored on robotic tape, and the laboratory is capable of transferring up to 100 gigabits per second over its network, said Ruth Pordes, associate head of the Computing Sector.

Dealing with that data is not an easy task.

“The one thing about big data is that it will kill you if you don’t handle it in a very automated fashion,” said Roser.

Most data is immediately useful to scientists in developing trends and determining patterns, but the exceptions to those trends are often the most important pieces. Those outliers have to be looked at on a case-by-case basis, difficult to achieve with the vast amounts of data involved Roser said.

Managing big data is nearly impossible without something known in computing circles as triggering, a process that helps determine what data is sufficiently valuable to keep. At the Tevatron, scientists developed sophisticated algorithms to decide in real time which few hundred of the two million proton-anti-proton collisions each second could safely be discarded without missing a discovery.

“One of our goals is to democratize the availability of the data,” said Lothar Bauerdick, U.S. Compact Muon Solenoid software and computing manager. “The data could be studied by everyone, not just scientists.” Roser added that one issue is to preserve data and analysis capability for the future.

Fermilab plans to expand its data capabilities by connecting this summer to advanced 100-gigabit network.

“When it comes to Big Data, Fermilab is already in, and will be continuing to grow, in this space,” said Fermilab CIO Vicky White. “We are definitely looking forward to continuing our contributions as part of this national priority.”

—Joseph Piergrossi