Turing Award Special: A Conversation with Jack Dongarra - Software Engineering Daily

Summary7 min read

Software Engineering Daily: Turing Award Special – A Conversation with Jack Dongarra

Released on March 18, 2025

Introduction

In this special Turing Award episode of Software Engineering Daily, host Shawn Falconer engages in an in-depth conversation with Jack Dongarra, a celebrated computer scientist renowned for his groundbreaking work in numerical algorithms and high-performance computing (HPC). Dongarra, a recipient of the 2021 Turing Award, shares insights from his illustrious career, discusses the evolving landscape of supercomputing, and explores future directions in the field.

Defining High Performance Computing

[01:12] Shawn Falconer: “What defines high performance computing and is that a moving target as mainstream computers that we use every day become more powerful over time?”

[01:32] Jack Dongarra: “High performance computing, or supercomputers, are usually specified as the fastest computers at any time. ... Supercomputers are fast in terms of floating point operations... They are characterized by being quite expensive as well. For example, the fastest computer today is located at Lawrence Livermore National Laboratory, costing about $600 million.”

Dongarra explains that the definition of supercomputers is dynamic, with each new generation surpassing the previous in speed and cost. The rapid advancement in mainstream computing power continually redefines what constitutes a supercomputer.

Jack Dongarra’s Career Path

Shawn Falconer delves into Dongarra’s journey into HPC.

[02:51] Jack Dongarra: “I initially wanted to be a high school science teacher... An internship at Argonne National Laboratory transformed my outlook, leading me to pursue computer science instead.”

Dongarra recounts transitioning from education to research, earning his master’s degree at Illinois Institute of Technology, obtaining a PhD from the University of New Mexico, and working at prestigious institutions like Los Alamos National Laboratory. His consistent focus on HPC has spanned decades, primarily at Argonne National Lab and later at the University of Tennessee and Oak Ridge National Laboratory.

Motivation in a Rapidly Evolving Field

[07:09] Shawn Falconer: “Is the moving target nature of supercomputing something that’s helped keep you motivated to focus on this field throughout your career?”

[07:37] Jack Dongarra: “It’s exciting to see new architectures and try to understand how they can effectively be used to solve problems. ... Each architectural change requires us to rethink algorithms, software, and numerical libraries.”

The constant evolution in supercomputing architectures, from scalar to vector computers, and then to parallel and multicore processors, keeps the field vibrant and challenging. Dongarra emphasizes the necessity of adapting software to leverage new hardware advancements continually.

Data Movement: The Bottleneck in Supercomputing

[15:57] Shawn Falconer: “What are some of the approaches to reduce the amount of communication that's happening at the hardware level?”

[16:15] Jack Dongarra: “Data movement is the biggest bottleneck... We need ways to overcome the memory bottleneck, such as embedding processors in memory itself and organizing computations around directed acyclic graphs to maximize parallelism.”

Dongarra highlights that while floating-point operations have become highly efficient, the primary limitation in supercomputing is the movement of data. Overprovisioning of floating-point units without corresponding improvements in data transfer rates leads to inefficiencies, with applications typically achieving only about 10% of a supercomputer's peak performance.

Benchmarking: Beyond Linpack

[22:54] Shawn Falconer: “Is the solve for that, that essentially there should be more than one measurement or KPI that is used to benchmark supercomputers?”

[23:18] Jack Dongarra: “The best benchmark is the application you intend to run. Linpack was developed when floating-point operations were very expensive, but it no longer reflects modern applications. We developed the HPCG benchmark to better represent current scientific computations.”

Dongarra discusses the limitations of the Linpack benchmark, which primarily measures dense matrix operations, and introduces the High Performance Conjugate Gradients (HPCG) benchmark. HPCG focuses on iterative methods used in solving sparse systems of linear equations, providing a more accurate assessment of a supercomputer's real-world performance. He advocates for a diverse set of benchmarks to capture the multifaceted nature of HPC applications.

Impact on the Top 500 List

[24:49] Jack Dongarra: “I want to augment the Top 500 with other benchmarks like HPCG. The Top 500 gives us a handle on peak performance, but additional benchmarks provide a more realistic view of application performance.”

While the Top 500 list remains a valuable tool for tracking the fastest supercomputers, Dongarra suggests complementing it with additional benchmarks to better assess performance across different types of applications. This approach ensures a more comprehensive evaluation of supercomputing capabilities.

AI and Mixed Precision in HPC

[26:15] Shawn Falconer: “There's a shift to AI-driven workloads with 16-bit versus 64-bit floating point arithmetic. How does that change how we measure HPC performance?”

[26:48] Jack Dongarra: “AI is driving the adoption of lower-precision computations. We’re moving from 64-bit to 32-bit, and now to 16-bit and even 8-bit floating point operations. Mixed precision leverages lower precision for speed while using higher precision to maintain accuracy.”

Dongarra explains that AI workloads benefit from lower-precision arithmetic, enabling faster computations and reduced memory traffic. This shift necessitates new algorithm designs that can effectively utilize mixed precision to balance speed and accuracy, fundamentally altering how HPC performance is measured and optimized.

[31:08] Shawn Falconer: Interjects with an advertisement which is skipped in the summary.

Exascale Computing: The Next Frontier

[38:57] Jack Dongarra: “Exascale computers perform 10^18 floating point operations per second. For example, a machine at Livermore National Lab has a peak performance of 2.7 exaflops with 11,000 nodes, each containing multiple CPUs and GPUs. These systems consume enormous power, about 34 megawatts.”

Dongarra provides a detailed overview of exascale supercomputers, emphasizing their immense computational power and energy consumption. He underscores the critical role of GPUs in achieving peak performance and the importance of application parallelism to fully utilize these massive systems.

Future Directions: Beyond Traditional HPC

[42:39] Shawn Falconer: “What are your thoughts on the potential of quantum computing? Is it overhyped?”

[42:57] Jack Dongarra: “Quantum computers won’t replace conventional computers but will augment them. They hold great potential, but currently, only a few algorithms can effectively utilize quantum computing. It’s an exciting research area, though somewhat overhyped.”

Dongarra acknowledges the promise of quantum computing as a complementary technology to traditional HPC. While optimistic about its future applications, he cautions against expecting it to supplant existing computing paradigms in the near term. He also mentions other emerging technologies like neuromorphic and optical computing, which could further diversify HPC architectures.

Key Contributions and Legacy

Reflecting on his career, Dongarra highlights three major contributions:

Numerical Libraries for Linear Algebra:
- Developing software that adapts to evolving hardware architectures, ensuring efficiency and performance in scientific computations.
Message Passing Interface (MPI):
- Establishing a community-driven standard for message passing in parallel computing, fostering interoperability and collaboration across research groups.
Performance Benchmarks:
- Creating and promoting benchmarks like Linpack and HPCG to evaluate and guide the development of supercomputers, enhancing their alignment with real-world applications.

[46:06] Jack Dongarra: “I’ve contributed to numerical libraries, the MPI standard, and performance evaluation through benchmarks like Linpack and HPCG.”

These contributions have significantly shaped the landscape of HPC, providing the tools and standards that underpin modern supercomputing efforts.

Conclusion

[49:14] Shawn Falconer: “Jack, thanks so much for being here. It’s been a real honor.”

[49:23] Jack Dongarra: “Great. Very good, Shawn. Thanks for the opportunity.”

Jack Dongarra’s insights offer a comprehensive view of the challenges and advancements in high-performance computing. His work continues to drive the field forward, bridging the gap between evolving hardware and the demanding needs of scientific research.

Notable Quotes

On Supercomputing Evolution:
- Jack Dongarra ([01:32]): “Supercomputers are fast in terms of floating point operations... They are characterized by being quite expensive as well.”
On Data Movement Bottleneck:
- Jack Dongarra ([16:15]): “Data movement is the biggest bottleneck... We have machines today which are really over provisioned for floating point operations.”
On Mixed Precision:
- Jack Dongarra ([26:48]): “We’re moving from 64-bit to 32-bit, and now to 16-bit and even 8-bit floating point operations.”
On Quantum Computing:
- Jack Dongarra ([42:57]): “Quantum computers won’t replace conventional computers but will augment them.”
On His Contributions:
- Jack Dongarra ([46:06]): “I’ve contributed to numerical libraries, the MPI standard, and performance evaluation through benchmarks like Linpack and HPCG.”

This conversation provides a valuable exploration of high-performance computing’s past, present, and future, guided by one of its most influential figures. Jack Dongarra’s expertise offers listeners a deep understanding of the intricate balance between hardware advancements and software optimization, underscoring the continuous innovation required to push the boundaries of computational science.

Loading summary

Transcript43 lines

[00:00]
Narrator
Jack Dongara is an American computer scientist who is celebrated for his pioneering contributions to numerical algorithms and high performance computing. He developed essential software libraries like Linpack and Lawpack, which are widely used for solving linear algebra problems on advanced computing systems. Dangara is also a co creator of the top 500 list, which ranks the world's most powerful supercomputers. His work has profoundly impacted computational science, enabling advancements across numerous research domains. Jack received the 2021 Turing Award for pioneering contributions to numerical algorithms and libraries that enabled high performance computational software to keep pace with exponential hardware improvements for over four decades. He joins the podcast with Shawn Falconer to talk about his life and career. This episode is hosted by Shawn Falconer. Check the show notes for more information on Shawn's work and where to find.
[01:08]
Shawn Falconer
Jack. Welcome to the show.
[01:09]
Jack Dongara
Yeah, thanks very much. It's a pleasure to be here with you.
[01:12]
Shawn Falconer
Yeah, thanks so much for being here. So you've spent a lot of your career working on high performance computing. So first of all, and maybe it seems like a basic question, but I think it's probably a good spot to start, is what defines high performance computing and is that a moving target as mainstream computers that we use every day become more powerful over time?
[01:32]
Jack Dongara
Well, that's exactly right. So high performance computing, or I would call supercomputers, are usually specified as the fastest computers at any time. So there's the marker. As time, as time goes on, these computers change, of course, and they get faster. And things that were supercomputers, let's say three or five years ago, are no longer considered supercomputers and they're replaced by the next generation of machines. Supercomputers are fast in terms of floating point operations, adds and multiplies, and they're characterized by being quite expensive as well. So the fastest computer that we have today is a machine that's located at Lawrence Livermore National Laboratory. So it's the fastest by some metric. And the cost of that computer is about $600 million. And that computer is a supercomputer today, but I would say five years from now it's going to fall off and not even be considered one of the fastest computers. So there's an investment that has to be made if you are interested in having a supercomputer or need a supercomputer that has to be replaced with that frequency.
[02:47]
Shawn Falconer
Yeah. And how did you first get interested in this research area?
[02:52]
Jack Dongara
Well, let's see. I guess I wanted to be a high school science teacher and I went to college to do that. And in my last year as an undergraduate, I was encouraged by my physics professor to apply for an internship at Argonne National Laboratory. So Argonne National Laboratory is a Department of Energy laboratory located just outside of Chicago. And that's where I was going to school. So I applied for this position and I received word that I got the appointment. And the appointment was to spend one semester with a scientist. So I worked at Argonne National Lab as my last semester. And that was transformational. It changed everything in terms of my outlook. I no longer wanted to be a science teacher. I felt I had ambition to go into research and to take on this challenge, let's call it, of working at a national laboratory. So I worked there for the last semester, and then I decided to switch and become a computer scientist. So I applied and was accepted at Illinois Institute of Technology in Chicago in the computer science program as a master's student. So I worked for my master's degree and Argonne National Laboratory offered me a position of one day a week working at Argonne. So I lived in Chicago, went to school at iit, and then worked at Argonne National Lab one day a week. After receiving my master's degree in computer science, Argonne made an offer to me. I decided I didn't want to go on and get a PhD. So Argonne offered me a position at the lab, a full time position working alongside of the researchers there at Argonne in what was called the Applied Math division. And that was again, a wonderful experience, working with experts, designing software for solving certain mathematical problems. And that was something which drove me. And we had frequent visitors to Argonne National Lab from outside. Visitors from various universities would come and spend a day or a week or a month working at Argonne alongside of the other researchers that I was part of. And they encouraged me to go back to school and get a PhD. So I went back to school with one of the people who gave me encouragement as his student, and that was at the University of New Mexico. So I went there and worked on my PhD. And while I was there, I had the opportunity of working at Los Alamos National Lab. So Los Alamos is another Department of Energy laboratory located maybe a couple of hours north of Albuquerque where the University of New Mexico is located. And I worked there. And that was, you know, again, a wonderful experience. So both Argonne and Los Alamos had supercomputers or machines that were at the top rank at that time. Los Alamos had just acquired a computer called the Cray Cray 1 computer. And that computer was different in terms of its architecture. It had vector instructions. So I had an opportunity to experiment and to basically to play with this computer that was there that was going to be used for scientific computations. But by the opportunity I was in a position to use it and to develop ideas and methods that would work well on that computer. And that was again a wonderful experience. And then I received my PhD in 1980 and went back to Argonne National Lab as a researcher and worked there for a number of years until transitioning to Tennessee where I am today. So I've had just a few jobs in my life. I'd like to say I had three jobs. One job was at Argonne National Lab. That was followed by the job at the University of Tennessee and Oak Ridge National Lab where I am today. But before the job at Argonne, I made pizzas. So, so those are the three positions I claim to hold. Pizza maker and then researcher at Argonne National Lab and then finally professor at the University of Tennessee and working as a researcher at Oak Ridge National Laboratory.
[07:09]
Shawn Falconer
Well, it's wonderful that you're able to early on in your life find something that you fell in love with and were able to do a long career in without having to move around a lot. But in terms of motivation since is the fact that what defines sort of a high performance computer or supercomputer such a moving target where state of art today is yesterday's news three years from now. Is that something that's helped keep you motivated to focus on this field throughout your career?
[07:37]
Jack Dongara
Oh yeah. It's exciting to see new architectures and then try to understand how they can effectively be used to solve problems. So the way I look at it is I've helped design numerical libraries. So these are software components which are used by other applications. And these software components are sort of basic in terms of the operations that they do. And those libraries basically have to be reorganized or rewritten, refactored every 10 years. And that refactoring is caused by the architecture changes. So you know, we have computers, if I go back in terms of, you know, where we came from, we had scalar computers, that is computers that had the ability to execute a single stream of instructions one at a time. And those machines were replaced with machines that were vector computers. So instead of just operating on one number, adding one number to another number, they operated by taking two vectors and adding them together, let's say, to produce a product. So you issue one instruction and that has an effect across this array of data. And that allows things to run much, much faster in terms of the flow of the data through the system. So vector computers caused this revolutionary idea to change the way the software was written to adapt to it. After vector computers, those computers were special purpose for scientific computations. So they were very expensive and only a limited number would be manufactured. And what happened essentially to the computing area was that we had this incredible improvement in performance of microprocessors. So we had this thing that I'll refer to as the attack of the killer micros. So those microprocessors became faster, more powerful, and were able to basically do the same function as those special purpose scientific computers that were characterized by vector computing. So microprocessors became the basic commodity component which was used in our supercomputers. So microprocessors then took on characteristics of being put together in a parallel context. So scientific computers evolved to using this microprocessor that were aggregated together in a parallel computer to help solve problems. So we have computers which had maybe 10 or 20 or 100 of these microprocessors together, communicating, passing messages back and forth over a high speed network to allow them to effectively solve problems. And that had a major change in terms of the software that we dealt with. Using that large number of processors together causes us to reorganize the algorithm that can effectively do that. And those microprocessors then were aggregated together and we reached the point where we had thousands of processors being used to help solve our problem. Multicore came in. And that of course added to the complexity, I'll say, but also to the layering of how these computers were effectively used. So in the end, the supercomputers that we have today are based on, I'll say commodity processors in terms of the basic organization is around the x86 instruction set. So intel and AMD have licensing rights for that instruction set. Our commodity processors use that. So our supercomputers basically have that as the core of their processing. And today they're augmented by GPUs. So graphical processing units have been added to the mix to help boost the ability to do floating point operations. So we have machines today which are hybrid having commodity components plus GPUs which are effectively used. And again, with each of these changes, it causes us to rethink the algorithms, rethink the software, rethink the numerical libraries so they can effectively be used on.
[11:56]
Shawn Falconer
This architecture in terms of like solving like these kind of hard science problems that motivate a lot of the work around supercomputers. Like, is there particularly a advantage that you get trying to solve that with something like a supercomputer versus being able to use some sort of like distributed network on the cloud over like multiple computers and sort of split that work in parallel? Or there's certain things that you just can't split in that fashion.
[12:23]
Jack Dongara
Right. So in general, the computations that are done on these supercomputers are very data intensive. That is to say, they do a lot of arithmetic and then they do a lot of communication, transferring information from one part of the machine to another part of the machine. So there's that kind of thing going on continuously in these computations. And if we think about a cloud based system, the cloud based system incurs a certain overhead associated with the latency of moving data over large distances and the bandwidth associated with that. So if we think about a truly distributed machine, let's say a machine that puts together components in various locations to create this virtual computer, that would not work very well in terms of a architecture for a machine that would be used for large scale scientific computing. But if we think about a cloud based thing, so if we think about, take Amazon as an example, using Amazon to do the computation on a single Amazon site, which had high performance processors, plus perhaps graphical processing units that could be used to do these computations. So the question then is we have to move the data from where it's located over to the cloud based service, do the computation there, and then drag the results back to the home base. So moving data is a very expensive operation and you don't want to do that very often. So you basically, using a cloud based system, you may get locked into that cloud based system to do the computations over some number of years if you're going to be doing it rather than transporting the data back and forth, because that's a very expensive operation itself. So the cost, if we think about buying a computer and think about using a cloud based system, so a computer that would be on premises as opposed to using a cloud based system. There's been many studies that look at the cost of doing that. And those studies usually come out with on premises computing is a better financial arrangement. And by better, I mean perhaps by a factor of two over using a cloud based system. So there could be a factor of two in terms of the cost that you would pay if you add everything up in terms of doing the computation. So again, you know, the cloud providers are providing the service, they set the pricing so that's based on them setting a price that's not competitive in some sense with the on premises computations that might go on for large scale, large scale computations.
[15:08]
Narrator
Developers, we've all been there. It's 3am and your phone blares, jolting you awake. Another alert. You scramble to troubleshoot, but the complexity of your microservices environment makes it nearly impossible to pinpoint the problem quickly. That's why Chronosphere is on a mission to help you take back control with Differential Diagnosis, a new distributed tracing feature that takes the guesswork out of troubleshooting. With just one click, DDX automatically analyzes all spans and dimensions related to a service, pinpointing the most likely cause of the issue. Don't let troubleshooting drag you into the early hours of the morning, just DDX it and resolve issues faster. Cycrosphere was named a leader in the 2024 Gartner Magic Quadrant for Observability Platforms at Chronosphere IO Sed.
[15:57]
Shawn Falconer
So you mentioned data movement there and communication. And even at the hardware level, I think that's probably one of the biggest challenges to essentially scaling up the amount of flops that these machines are capable of. What are some of the approaches to reduce the amount of communication that's happening at the hardware level?
[16:16]
Jack Dongara
Right, so you've hit on really the bottleneck or the point of contention on these computers. So the thing which is most expensive on our computers is data movement. So it's not the floating point operations, so it's not the computation that we do, it's the movement of data to the place where the computation is going to take place. So that turns out to be the biggest bottleneck. And our machines today are really over provisioned for floating point operations. So they have too much capacity and we can't get the data to them. So we need, in future designs, I'll say we need ways of trying to overcome that memory bottleneck. So that really is the biggest challenge that we have. So there's a lot of ideas that are floating around on how we could overcome that, ideas like processor and memory. So we take processors and embed them in the memory itself so that the processors are very close to the point at which the data is located and data then can flow into those processors. We also have to come up with other techniques to avoid the latency of moving data. So some of the techniques that are used are organizing the computations around a directed acyclic graph. So a directed acyclic graph has the ability to uncover the maximum amount of parallelism in a computation. So we can exploit that parallelism and hopefully get a much more optimal solution. But this issue of moving data is really critical in terms of the efficiencies that we see on our computers. And I'll let you in on a dirty little secret that we have, and that is if we take a look at the peak performance of our supercomputers and we look at the applications that are running on those supercomputers and look at the performance that we actually see from those applications, we're getting roughly 10% of the peak performance on these supercomputers. And that's a result of moving data. We can't move the data to the processors effectively, and that's causing this low efficiency in terms of our applications. So why do we end up with.
[18:30]
Shawn Falconer
This case where things are over provisioned for the floating point operations?
[18:35]
Jack Dongara
Right. So it comes about from a number of reasons. You know, we have certain ways of measuring performance of machines. So I created a benchmark a number of years ago, and that benchmark is called the Linpack benchmark. Linpack benchmark solves a problem. It's a system of linear equations. And that system of linear equations is used in many applications. This benchmark was created back in the late 70s when floating point operations were very expensive. And this benchmark, as a result of floating point operations being very expensive, mimicked the way real applications would perform. But as the hardware changed over time, the benchmark itself has really less reflected how real applications are evolving. And architectures are trying to do very well on this benchmark. So they put instructions and design things to effectively do this benchmark very well. So the benchmark has at its core a matrix multiply, two matrices being multiplied together. So that's an operation which can be highly optimized. And the hardware on our machines today, hardware on our machines and on our GPUs can do that operation very efficiently. So that operation has been optimized to a level which gets very close to the peak performance. So if our applications did that operation, we would match the peak performance. Unfortunately, matrix multiply is not the way in which we tackle problems today. So we use other techniques which don't have that ability. So matrix multiply has the ability of moving N squared pieces of data and doing N cubed operations on it. So that's a surface of data and a cube of floating point operations. So that gives rise to a situation where you could move little data and do a lot of operations on the data that you move N squared data movement n cubed operations on the data. So that is something which would be ideal for an algorithm if it just did that. Today's algorithms, unfortunately don't just do that operation. So what would we do on our supercomputers? Our supercomputers are used to solve scientific problems. And those problems span the range of looking at weather forecasting, climate modeling, looking at applications that are trying to optimize combustion engines, looking at nuclear reactors. Some are used for nuclear weapon design. So all of those things usually center on solving a three dimensional partial differential equation. So that three dimensional PDE that's being solved is going to solve a system of linear equations. And that system of linear equations is not dense, it's sparse. And by sparse I mean it has very few elements in this matrix. And we're going to try to solve a system of equations where we have a lot of zeros in the matrix itself. In order to solve that system, we use what's called an iterative method. And that iterative method has the property of doing really n operations on n pieces of data. So we have to move n pieces of data and then do n floating point operations on it. That's the basic pattern in which these algorithms operate. And that's very unlike what was done for a dense matrix where we had n squared pieces of data and n cubed operations on it. So we move a little bit of data into a lot of operations with the PDEs. Unfortunately, we're in a situation where we just move n pieces of data and do n operations on it. And that causes this Only less than 10% of peak performance because of the movement of data, which is very poor compared to the rate at which we can do the floating point operations. So we're basically starving the floating point potential of these machines. The floating point units of these machines are being starved for data, waiting for the information to flow to it.
[22:54]
Shawn Falconer
Yeah. And back to Windpack and where that started. There's a saying in business of you optimize the things that you measure. So be careful about what you measure. And because you end up creating these biases, essentially. And it sounds like that's kind of what's happened there. So is the solve for that, that essentially there should be more than one measurement or KPI that is used to benchmark supercomputers.
[23:18]
Jack Dongara
Right. So of course, the best benchmark is the application that you have. So if you have an application that you're intending to run, that would be the thing to really benchmark. So benchmarks, the ideal benchmark would be multiple Benchmarks that sort of span a space of applications. So today we have this. In the past we developed Linpack and we thought that was a good measure. Today, it's not a very good measure of how our computers really operate. So we've developed other benchmarks. I mentioned solving PDEs and this iterative process. So we've developed another benchmark called the HPCG benchmark, High Performance Conjugate Gradients benchmark. And it uses an iterative algorithm so it matches or tries to imitate what applications do in solving that three dimensional partial differential equation. So it's trying to get a handle on what the performance is and what the bottlenecks are on machines today on problems that are important to solve. So that's another component or another benchmark that we have that can be used to effectively deal with our situation today or our understanding of how these machines can be used and where the bottlenecks are in the machines themselves. So we need more of those things. That would be the ideal situation.
[24:40]
Shawn Falconer
Right. And do you think that that will change essentially how or lead to an overhaul of like the top 500 lists that people use to measure supercomputers today?
[24:50]
Jack Dongara
Right. So we have this top 500 list which measures the 500 fastest computers. In retrospect, 500 may be too many, but okay, that's what we have. We have all that data. I view the top 500 is giving us a handle on trends and it sort of plots things in a nice way so we can see where we are today, at least in terms of what we think of as the theoretical peak performance for these machines. So if you don't do well on the top 500, you're probably not going to do well on other applications. That's one way to look at it. So we have all this data and it presents a good point to keep. So I don't want to lose that top 500 data. I want to augment it. I want to augment it with other benchmarks. So again, we have this thing called HPCG which measures another aspect of the computers. It really is trying to get a handle on this data movement to see how well we do with that. And that sort of augments the performance. So there's another list. It doesn't have quite 500 machines on it. It has only a few hundred machines on it. And that sort of shows the difference between the peak performance or the Linpack benchmark and this thing, which is more realistic in terms of what our applications can do today. And we need to develop Others. So that I shouldn't say we just have two, we should develop other benchmarks. Indeed. Yeah.
[26:16]
Shawn Falconer
It relates to that I read your paper Reinventing High Performance Computing Challenges and Opportunities, which I encourage anybody listening to check that out as well. But one of the things that you talk about in there is how there's a shift essentially to AI driven workloads, a lot of that happening by these cloud vendors. And there typically, with AI driven workloads, you're talking about 16 bit arithmetic versus 64 bit floating point arithmetic. How does that change the way that you need to think about measuring the performance of high performance computing?
[26:49]
Jack Dongara
So AI is a tremendous force in terms of computing and in terms of our understanding of how things are happening. So it's a great tool and we use it all the time and it's being used by all of the applications to really strengthen their ability. So going back to the situation that we have, I'll say in the old days we had computers which had 32 bit and 64 bit floating point operations. Those were sort of the basic thing that we had to work with with GPUs. We had another level coming in, so we had 16 bit floating point operations. So Nvidia provides us with 16 bit. IEEE has a standard for doing 16 bit floating point operations and that is implemented in the hardware. Google came out with something called BF16, which is another representation of 16 bit floating point operations and has a slightly different configuration of how many bits are in the exponent and how many bits are in the fraction. And Google probably is in a better position for doing computations. It gives up a little accuracy, but it has a wider dynamic range over what the IEEE is providing. And then Nvidia in its hardware has, in its more recent generations has 8 bit floating point arithmetic. So we went from 64 bit to 32 bit to 16 bit to 8 bit floating point arithmetic. And that lower precision is being driven by AI. It's being driven by how the AI process, how the neural networks work. So in the neural networks, there's a forward propagation and a back propagation. The forward propagation through the neural network requires slightly higher precision for doing the weights and the activations and the back propagation. The gradients in the back propagation require a high degree of dynamic range. So that says the exponent needs to be changed so that Nvidia has two formats for 8 bit floating point arithmetic as a result of that. And Those operations in 8 bit and 16 bit run very fast compared to what we can do in 64 bit floating point operations. So that's caused us to rethink how we do our algorithms and can we leverage those lower precision computations in our algorithms themselves. So again, our algorithms have traditionally been written with 64 bit and 32 bit, and now we're looking at using 16 bit and maybe even 8 bit floating point operations to help our computations work through their process. And the way to think of this is we're trying to leverage the lower precision to get an approximation for the solution and then use the higher precision to reinforce or to increase the accuracy of the solution that we, that we obtain. So we do something very fast in lower precision and then do something slower in higher precision to refine the solution to get up to a point where it's acceptable for our computations. So there's a whole area of research going on today in this mixed precision, trying to leverage lower precision computations to get a higher speed and then pass it off to a second stage which refines the solution to get this higher precision.
[30:22]
Narrator
Understanding the details of infrastructure tools matter and there's no better way to understand that than looking directly at the code. Open source code bases give everyone the ability to inspect, audit and contribute to the software they use, enhancing trust and transparency. Bitwarden is a trusted open source and end to end encrypted security solution that empowers businesses and individuals to securely manage and share information online. Made by developers like you, Bitwarden offers open source solutions for virtually every credential management use case, from secrets management to password management and passwordless. Developers can even securely manage their SSH keys with the new Bitwarden SSH agent. Get started on your open source security journey today and start your free trial@bitwarden.com.
[31:09]
Shawn Falconer
Do you feel like that is the sort of most, I don't know, like practical path currently to reaching the next generation of supercomputers?
[31:19]
Jack Dongara
I would say that it's a viable path that's being explored and being used today because the hardware is there and the hardware is there not because of the scientific computations that we do in hpc, it's there because of AI. So we have to try to use that to get better accuracy and to also get better speed. So understand with lower precision there's less communication, so we have less memory traffic because we're communicating, let's say 16 bit words instead of 64 bit words. So that helps in terms of the data movement. So we have a less memory footprint as well if we're doing data in 16 bits. So we store less, less things and the arithmetic operations Go faster. So we usually think of going from 64 bit to 32 bit. There's a factor of two in the 32 bit, a factor of two speed increase in the 32 bit, and if you go to 16 bit, there's another factor of two. So we get much, much faster speeds. In terms of our floating point operations, we reduce memory traffic, so our data movement has been reduced. We don't, we don't have so much of a memory bottleneck and we can carry on our computations. I would say there's a concern though, and the concern is that because AI is so important, hardware manufacturers are providing those 16 bit, 8 bit. They're doing 32 bit. But the newer products, in fact, the Nvidia products have gotten to the point where the 64 bit floating point operations on the new hardware is really not very efficient. So it doesn't run very well. It runs less than what we were seeing with the previous generation. So Nvidia has a number of products. The current thing is called Hopper. The new one is called Blackwell. It's just about to be released. And the black, just to put this in perspective, the Hopper, the current generation that we have, the 64 bit operations run faster than the newer processor called Blackwell. So run faster in terms of operations per second. But in terms of 32 bit and 16 bit and 8 bit floating point operations, Blackwell really exceeds the performance from the Hopper. So less effort is being put into the 64 bit floating point operations and more effort, more efficiency is being gained by using those operations which are really there for AI purposes. So the scientific community needs to focus on the 32 bit and 16 bit or mixed precision.
[34:03]
Shawn Falconer
Right. What do we give up by, you know, sort of over focusing on the AI specific workloads versus, you know, being able to continue to improve the performance of the 64 bit operations.
[34:16]
Jack Dongara
So I would say that it's leading us to a point where the scientific users who need the accuracy are not going to see improvements in the next generation of machines. If they can get by with less accuracy, or if they can get by with a mixed precision algorithm, then they would be able to see the advantages of that newer architecture. So it's going to cause, so, you know, a shift again where people have to rethink how they implement things. So there's a refactoring that goes on of redesigning stuff around the architectures that we have. As I mentioned earlier, you know, we have to redesign our software as the architectures change and the architectures change in a radical way. I'll say every 10 years. So there's that reinventing or rediscovery that has to go on to refactor the software to effectively embrace the hardware that's being presented. And from a pessimistic standpoint, you know, the way I sometimes characterize it is we ask for a supercomputer and we ask for it to have certain peak performance and we have certain amount of money that we're willing to pay for that. So we bid a computer based on those things we the peak performance and the money that we have. And the result is that manufacturers produce a machine which sort of matches that peak performance and comes in at the budget of what we, what we have. But being able to use that machine requires a tremendous effort in terms of putting together algorithms that can fit onto the architecture that we have and trying to really embrace what's there. A better way of designing a machine would be using what's called co design. So co design says we get the architects together with the application people, with the software people and with the mathematicians and we design a machine that can be effectively used by that, by that group, a special purpose machine for the scientific applications. But unfortunately that'd be too expensive to manufacture. Those machines would have very few machines that would be produced as a result of that. And it would go the way of the old vector computers, which are considered dinosaurs today.
[36:39]
Shawn Falconer
Are there some people that are, you know, using this kind of end to end co design approach?
[36:46]
Jack Dongara
Oh sure, yeah. So co design was done, was tried with the big supercomputers that we have working at the national laboratories. So the Department of Energy is highly engaged in using high performance computing for solving their most challenging problems. And they invest up to $600 million per computer. There are three large exascale computers. Exascale computers are 10 to the 18 floating point operations per second and those are 64 bit floating point operations. Those machines are at Oak Ridge National Laboratory, Argonne National Laboratory and Lawrence Livermore National Laboratory. So those were machines that were recently manufactured and put in place. They have co design in the architecture, trying to match the architecture with the applications. But it's hard to design to redesign a machine based on trying to get the machine with that comes in and at the appropriate budget. So again, all the machines are based on Commodity Processors x86 architecture. Intel and AMD are the basic instruction sets that are on those machines. And each of Those machines have GPUs. The Argonne machine is using intel as their processor and intel has a GPU which is being used with it. Oak Ridge and Livermore have an AMD processor with an AMD GPU associated with it. And those machines are being used to drive our large scale scientific computations today.
[38:27]
Shawn Falconer
Yeah, I wanted to dive in a little bit on this topic of exascale. So as you mentioned, it's 10 to 18 flops. I still remember when the first 1 GHz processor came out. You're talking about 10 to the 9. And now modern chips on like, you know, the latest iPhone, you're starting to hit 10 to 12 in teraflop range. We still have petaflop and then exaflop. I mean, what's it take to kind of reach that raw computational power? Is it all about the architecture or there are other things that have to go on?
[38:58]
Jack Dongara
Well, of course it's based on architecture. The architecture has to be able to support that. So, you know, we think about our computers today and you know, those computers are potentially incredible in terms of their capability. So we think about the machine at, let's say at Livermore National Lab. So it has a peak performance, this is a peak performance of 2.7 exaflops. So that's the theoretical peak performance for 64 bit floating point operations. That machine has a large number of processors and a large number of nodes. And each node of the machine is composed of three AMD CPUs. So think about commodity AMD CPUs. Each one of those have eight cores. So three eight core CPUs plus three AMD GPUs. And those things are put together and that's considered a node of a machine. And the full machine has over 11,000 of those nodes in the machine. And those nodes are connected by a high speed interconnect that allows them to pass information from one node to another. So we have to connect 11,000 nodes in this machine to effectively put everything together. The machine consumes 34 megawatts of power. So what's a megawatt? So at my home here in Tennessee, if I use 1 megawatt over the course of a year, I'll get a bill from the electric company for $1 million. So a megawatt year is about a million dollars. So this machine at 34 megawatts, that translates into $34 million just to turn the thing on. So you can see how these machines are enormous. The floor space is about two tennis courts. So think of two tennis courts with these 11,000 nodes which, with an interconnect trying to make things work correctly. So I Mentioned that it has a peak performance of 2.7 exaflops, but that's for doing 64 bit floating point operations. If your application can use 16 bit, and again 16 bit is in the GPUs that are there really to help with AI computations and machine learning things and data analysis, you can reach 17 exaflops. So that's sort of the difference between 64 bit and 16 bit floating point operations. 2.7 compared to 17. There's a big potential there if you could use that power. Now the machine has all of these components in it, but the GPUs are really where the performance comes in. 99% of the performance of this system is based on the GPU performance. So it's critical that the applications run in parallel. There's 11,000 nodes. It's critical that the applications use the GPUs because 99% of the performance is coming from the GPUs. And it would be critical if the applications can use mixed precision, could benefit from some of that 16 bit flow floating point operations that are the potential from this architecture. So these are incredible systems. I think of these machines similar to the Webb Telescope, Hubble Telescope. You know, they have tremendous. It's a tremendous instrument for doing scientific discovery and we need to use them effectively to get to the potential that they really have.
[42:39]
Shawn Falconer
And thinking about the future of like HPC and moving beyond some of the types of architectures you're talking about, there's things like quantum computing, which there's a lot of hype around and Google, IBM have big investments there. What are your thoughts on potential of quantum? Is it overhyped?
[42:58]
Jack Dongara
Of course all new things are overhyped. So we have to give them some leeway here doing the hyping. So quantum computers are great potential for the future. They're not going to replace the conventional computers that we have, they're going to augment them. So it's a great area for research is the way I look at it. And all of the hype I have to assume that that's going to happen. It's happened for most systems or most radical changes that we have have gone through this hype that we're experiencing with quantum. And there'll be some disappointments because the hype will not be reached, of course, but there's great potential. Today we really have a handful of algorithms only. Only just a few algorithms that are can be expressed in terms of the use of quantum computing. And we need to develop more algorithms and we need to understand you Know where that fits in in terms of the way in which we are doing our computations. So I think about computers today, the physical computers, and we have CPUs which are commodity processors, are augmented by GPUs which have this ability to do AI computations and lower precision very, very efficiently. And you know, we potentially have the ability to augment that with a quantum based thing that we can add onto it. So I see the supercomputers of the future having again commodity processors, GPUs and other things attached, one of which could be a quantum computer. Another thing we might think about is a neuromorphic computer. We might think about an optical computer, we might think about some DNA based computational device all being used in some way where it fits the applications that we intend to run on that machine. Putting a quantum computer today on a supercomputer, you know, is a good experiment. It's good for research, it's good for those applications that could benefit from that quantum thing. But the quantum computer is not going to replace the computers that we have today. It's not going to replace my laptop in my lifetime anyway. I think it's a great research endeavor. It's overhyped, but we have to give them some leeway in terms of the hype and it's a great area for new ideas emerging.
[45:19]
Shawn Falconer
Yeah, I remember hearing about DNA based computing like probably when I was in graduate school, which now is 15 plus years ago, but I don't think I've heard anything since. What is the state of the art there?
[45:31]
Jack Dongara
Yeah, so I mentioned it. I don't really know what the basis is for DNA computing today or where they sit in terms of the potential of doing anything really reasonable. I'm not an expert in DNA computing.
[45:45]
Shawn Falconer
Okay.
[45:45]
Jack Dongara
Yeah.
[45:45]
Shawn Falconer
I was just curious. So looking back on your career, you've made some really big contributions to science, computer science, research world that have had long lasting impact. Like, you know, looking back at your career, like what do you think is the project that you're, you know, either most proud of or you feel like had the most impact?
[46:06]
Jack Dongara
Well, so I like to think I've contributed to three areas and I've contributed to the design of numerical libraries for dealing with linear algebra. And those libraries have undergone changes at every major. They've tried to follow the architectures that were coming out. So again there's these new architectures that emerge every 10 years and the software and the libraries have to change or adapt to the architectures that are there. So one of the things I feel I've contributed is this refactoring of architectures that have followed the evolution of the hardware, trying to match hardware in some way. The second thing I've contributed to is helping to design and come up with a community based way of doing message passing. So in the early days, we had computers that were parallel and that had to pass information from one point of the computer to another point. And each manufacturer had their own way of doing that, and each research group had their own way of implementing that. So we had what I think of as the wild west, where we had many, many ideas and many things that were tried. And I was able to help put the community together to come up with a standard for doing message passing for scientific computations. That standard is called mpi, the message passing interface. So a small group of people, small group of talented researchers got together and within a year and a half put together something that became the community standard that is in use today and has been used for the last 20, 25 years in doing our computations. So that's a major thing that has had an impact on the way in which the fabric of our scientific computing is done. And the third thing relates to performance evaluation. So again, I was instrumental in putting in place this thing called the Linpack benchmark, which is the basis for the top 500 that we put together and that has evolved to other testing mechanisms. I mentioned HPCG benchmark, which does something more relevant to our computation today. And that has actually evolved into something called another thing which uses mixed precision, trying to have a benchmark which leverages the 64 bit, 32 bit, 16 bit computations in a way that tries to expose what the potential is if applications can effectively use that level of performance. So I would say those are the three things. The numerical libraries that have followed the architecture trends, putting in place a standard which is used by the community for doing message passing, and then an effort for doing performance evaluation through perhaps benchmarks. Awesome.
[49:15]
Shawn Falconer
Well, Jack, thanks so much for being here. I feel like I could pick your brain all day, but I'm sure you have other things that you need to do besides talk to me, but it's been a real honor.
[49:23]
Jack Dongara
Great. Very good, Shawn. Thanks for the opportunity.
[49:26]
Shawn Falconer
Cheers.
[49:27]
Jack Dongara
SA.