Transcript
A (0:03)
You are walking down an alleyway late one night and suddenly behind you a mysterious dark figure appears and says, sir, I promise you, a computer, a computer up to 20 times faster than what is on the market. You think, well, what sort of unholy, insane 2 nanometer silicon will this computer be using? No, the dark figure replies, ordinary hardware, same as what everyone else is using, may be even a bit slower, clock time wise. So what is the catch? If you accept this computer, then you must also accept some of the most brilliant but maniacal compiler software ever created. Oh. In today's video, we trace a radical idea to a computer 10 to 30 times faster than anything thought possible. They said it was impossible. An intrepid platoon of geniuses proved them wrong. This video is brought to you by the Asian Armitage Patreon. I want to start with a video game that I liked to play during the pandemic called Overcooked. In this video, you play a chef in a relatively bizarre kitchen and you must cook and serve finished plates of food to fulfill incoming customer orders. The recipes are largely simple. Two that I remember and are illustrative of where we are going are the salad and the hamburger. Let's begin with the salad first. The basic salad is easy. Your chef character gets the lettuce from the bin, brings it to the chop station and chops it, sticks it onto a plate and then delivers to the customer. The burger is a bit more complex. You take a bun, lettuce and beefsteak out of the bin. You must chop the lettuce and beefsteak. Do not chop the bun. However, the beefsteak also has to be cooked on the oven. Do not cook the steak until you have chopped it first. The world of the CPU is a little like Overcooked. A CPU is a kitchen that takes in raw data, inputs, ingredients and and transforms it to get finished outputs. The various steps to produce a salad or burger dish are instructions. An instruction is essentially just a string of ones and zeros that tell the CPU what specific action to take and on what data. Get the lettuce, chop the meat, plate, the burger. The CPUs life is to fetch instructions out of its high speed memory called called its register, decode them and then execute them. There are more steps in this life cycle, but fetch, decode and execute are the basics. When a programmer writes software or vibe codes it using Claude code, they are most often writing that program in a higher level language like C or Fortran that they can read, but the CPU hardware can't read that. So the vibe coder's program must be translated or compiled into instructions that the CPU hardware can chop the tomato, chop the lettuce, plate the salad. This is done by a special software called the compiler. The compiler can also do a whole bunch of other stuff, but let me get to that later. Put together all the instructions that a CPU can handle, make up what we call the instruction set. In a way, we can consider the CPU itself, the literal physical manifestation of its instruction set. After all, what sits inside the kitchen tells you what that kitchen is capable of. No, this all makes sense right now. How might we get our food faster? A simple way is just to make the chefs move around faster. In that, I mean to raise the CPUs clock speed. So straight Moore's we shrink the transistor so signals travel faster from source to drain. The fabs were doing that. But is there anything else? Early in computer history, various scientists proposed concepts to speed up operations with a little cleverness. Let us consider the burger. You might decide to do all the steps but bun lettuce meat one after the other. But if you think about it, we have time spent doing nothing while waiting for the meat to cook. Why not use that time to chop the lettuce? So let us do the meat first, then chop lettuce. This is parallelism, executing various instructions out of order for faster overall processing. Specifically, this is a category of techniques that we call instruction level parallelism, or ilp. I hope this made sense. I am now leaving the overcooked metaphor behind before it overcooks. The trouble with parallelism, however, is that many instructions depend on the outputs of their priors. You cannot grill the burger meat unless you have chopped it first. You can't decode something until you have fetched it first. You cannot add A and B together until we know what A and B actually are. We generally break up a program's code into blocks. Each block is usually only a few instructions long, just about six. Each block tends to end with a conditional branch, like an if, else, statement, function, call, or loop. The name tells it all. The conditional branches create branching paths in the code. The presence of such branches means that we don't know what instructions will run or what data they will run on. Without this, we cannot run that many instructions in parallel. A famous 1970 paper looked into this and with a few assumptions, concluded the that program code, on average, contained enough independent work to only do about two operations at the same time. So that seemed to settle it. Parallelism can't do that much for us, right? But what if we decided to throw out those assumptions, free ourselves from all the legacy stuff? What then is possible? In the late 1970s, a graduate student at the Corinth Institute of Mathematics at NYU named Josh Fisher joined a project to build an emulator of the supercomputer CDC 6600. They called it Puma. The CDC 6600 is an iconic computer, but PUMA would try to shrink it using modern integrated circuits. Fisher's role at the start involved making tools for computer aided chip design, programs to layout and route wires, as well as simulation. He then started working on the 64 bit microcode for the emulator. This microcode sought to break down pieces of program code originally meant to run linearly into smaller simpler blocks that can be run in parallel. As he did this, it occurred to him that these two tasks, chip layout and code scheduling, were conceptually similar. Both ingest one dimensional lists and put out two dimensional maps or grids. A program producing a chip layout takes in a netlist, which is a simple one dimensional list of an IC's transistors and produces a layout, a two dimensional map of each of those transistors placements. With code scheduling, it was similar. Turn a list of operations normally performed linearly like fetch, decode, execute and turn it into a two dimensional grid where those instructions can be run in parallel. This led him to an optimization technique known as trace scheduling. Gosh, how am I going to do this one? Let me revisit the concept of a program being broken up into many blocks. Each block is maybe just a few instructions long and as I said before, ends with a conditional jump like an if else statement or loop. Such things create branching paths in the code. Since it had been assumed that we cannot predict the future state of the program, we are left to only search for parallelism opportunities within these tiny blocks. There are not many trace scheduling busts through this assumption. The compiler traces through the entire program like as if it is a single block, and predicts a likely code execution path using either heuristics or actual performance information. In essence, assuming that the program's conditional jumps do not exist, this magical compiler then aggressively schedules all the instructions in that trace, moving up anything that can be parallelized. These instructions are bundled into a very long instruction word. Sometimes the trace gets it right. For example, most of the time with a loop, it's probably most likely that the program just jumps back to the start of the loop again. It can just presume that is the case. If the trace gets it right, then we blast through the program like the Millennium Falcon, achieving Parallelism Speedups of 10 to 30 times, far beyond what was previously thought possible. So what happens if the trace doesn't get it right? With existing CPUs, there are extra circuits or hardware to fix such compiler mistakes on the fly. It is nicer for them because they know the variables at runtime. Trace scheduling eschews that because it means more complex hardware, which goes against the philosophy of shifting complexity to the software. So we need something else. Imagine we're cave diving and I will never go cave diving and we come across a divergence. We might attach a string or line so we can backtrack. Compensating code is the compiler's cave dive string. If the compiler decides to move up an instruction, it must add code so that if the trace goes wrong, the computer can backtrack or redo. The compiler might also continue on tracing a different path, hoping that it eventually returns to the fold. There is a real risk of code bloat where the compiler adds so much compensating code that the thing cannot achieve the promised performance goals. Hopefully you can see why this can get a little tricky. We must guess the program's future before we run it. The compiler practically has to be a time traveling Mary sue. After finishing his PhD, Fisher found a tenure track postdoc job at Yale, moving there in 1979. Soon thereafter, he gathered a team of smart and talented students and tried to implement in software the trace scheduling technique that he had written about before. In 1980, Fisher consulted for General Electric, working to adapt trace scheduling for a computer made by a company called Floating Point Systems. But it ended up failing because the hardware's fussiness and complexity made it difficult to achieve much parallelism. Embarrassed by the failure, Fisher dug up manuals for other computers like the CDC, CyberPlus and Foundation, that those two were too complicated to get the degree of parallelism he wanted. It always seemed like the highest possible parallelism speedup was about two to three times. Fisher got frustrated and eventually he came to believe that the only way to get the gains he sought was to implement both the compiler and hardware simultaneously. And he starts thinking about what kind of architecture would such a computer need to have. In 1982, Fisher submits a paper to the International Symposium of Computer Architecture about his work. It is titled Very Long Instruction, Word architectures and the ELI512 and details his new trace scheduling centric computer, the ELI512. The ELI stands for Enormously long Instruction and also serves as an inside joke for Yale people. It was A simplified RISC device, but modified to run a pack of multiple instructions. Fisher writes, everyone wants to use cheap hardware in parallel to speed up computation. One obvious approach would be to take your favorite reduced instruction set computer. Let it be capable of executing 10 to 30 RISC level operations per cycle, controlled by a very long instruction word, in fact, called a vliw. A VLIW looks like a very parallel horizontal microcode. The paper's real audacity was less its hardware than its trace scheduling enabled compiler named Bulldog. Today this paper is acknowledged as one of the greats, but when it was released, people greeted it with polite incredulity. A graduate student at Carnegie Mellon named Bob Colwell recalls reading the VLIW paper and thinking, this guy was nuts. I thought, he wants to do what with a compiler? This guy is nuts. He wants to move code all over the place and then make up for that intentional code misbehavior with yet more code. You'll never get away with it, it seemed to me. And even if you can patch things back up, the new overhead will kill performance, he thought. The complexity would eat them all alive. Thusly, it was inevitable that Colwell would, later on down the line, work with Josh Fisher. As the months passed, Fisher discovered a curious effect. He discovered that whenever he came to talk about the VLIW computer, people filled the room, mostly to tell him that he was wrong and that his computer was impossible. The energy was electric. By contrast, whenever he came to talk about the trace scheduling compiler technique, it was crickets with gentle general agreement that why yes, this can work, he later reminisced. You can get more people, a lot more people to come to your talk if you promise them bizarre sounding hardware instead of a compiler technique. End quote. To Fisher, this made little sense, because to him, the hardware was the easy part. The compiler does all the hard work of arranging and scheduling the instructions across blocks for max parallelism. The hardware just does whatever the compiler tells it to get the compiler right, and the rest falls into place. And why does this approach feel so alien to people? Is this not what John Kock, of RISK fame, also advocated? Fisher started writing papers evangelizing the VLIW approach with provocative titles that got the people going. It brought him notoriety. But young and restless, he soon realized that he wanted more. People kept telling him that his computer was impossible. He wanted to prove them wrong. But building such a computer from scratch took more resources than what can be found within academia. And the big computer companies like IBM or DEC seemed to have no interest in funding something so unproven as the VLIW technique. The early 1980s were an interesting time for hardware startups. In 1979, new US laws allowed pension funds to invest into venture capital funds, greatly expanding their assets under management. Total venture capital funds would grow tenfold between 1980 and 1989. A lot of this VC funding went into computer startups seeking to challenge entrenched players or just try new things. Apollo Computer, Silicon Graphics, Compaq, Thinking Machines. They were all founded around this time. As the VC boom grew, computer science academics began leaving to found their own computer hardware startups. Much like how many university researchers in AI today are leaving to do AI Neolabs. After five months of pondering and consulting with family and colleagues, Fisher too decided to leave Yale and do a startup. He was joined by his graduate student John Rutenberg and systems manager John o'. Donnell. Since this was the VC boom, they naturally wanted to take VC money. But in late 1983, they met with Apollo Computer, the famed workstation maker. Apollo offered to fund the VLIW computers development. Once it was done, Apollo would market it. To get started, the company offered a $500,000 loan. Now for a name. The obvious one was VLIW Technology. There was a company out there called VLSI Technology too. But Josh Fisher wanted something warm and and fuzzy because the term VLIW by then had already gained a little notoriety. They pondered Mercury because it was in the similar line as Apollo and Elm City Supercomputer because that was their city. In the end, Rutenberg coined the name Multiflow, which seemed to convey the visual of logic flowing through the computer. Despite concerns that people might think their company drilled for oil or produced high tech toilets, they chose it. And thus in April 1984, multiflow got started. Just six months later, the Apollo deal collapsed after its CEO was replaced. With money running low, Fisher and the other co founders went to the VCs personally borrowing money to keep paying employees. In February 1985, they closed a $7 million round. Another 26 million over two rounds would be raised over the next two years. By the way, before I continue, I want to highly recommend the book Multiflow A Startup Odyssey by Elizabeth Fisher. Elizabeth is Josh's wife and thus had a front row view to the whole saga. It is fantastic. Read it for an in depth look into life at a startup. Multiflow sought to develop and sell high performance computers for the scientific and engineering markets. These silicon monsters do the most complex, time consuming calculations, often with many decimal points of accuracy. They also cost upwards to tens of Millions of dollars, which restricted use to the biggest government labs on a timeshare basis. What if we can produce a supercomputer with a significant percentage of performance at a fraction of the price and size? This would make more compute power available to companies who needed them to run increasingly special calculation. Now, there always had been small supercomputers, but those truly were supercomputers. Made by traditional supercomputer makers like Cray or Fujitsu, they were still quite hefty. Then, in 1985, a small startup in Texas called Convex released the C1 computer. They marketed it as a mini supercomputer or mini super or super minicomputer and called them a new category of computer. Some people dubbed them Crayettes, which is funny to me. Enabled by advancing VLSI semiconductor technology, the C1 was less a small supercomputer than a souped up minicomputer. That made them more of a threat to the famed minicomputer maker DEC than Cray. But anyhow, the category took off with a wave of new entrants. There were a lot of these guys. There was Alliant Computer Systems, founded the same year as Convex Scientific Computer Systems, which booted up in 1983. And then after that, a half dozen new companies like Cydrome, Gould and of course, Multiflow. Multiflow would arrive in the second wave of the super mini boom, meaning that they had to take on several established players. They had to turn the ELI 512 paper's concepts into a working, manufacturable product, then convince actual customers that the impossible computer was indeed real and worth adopting. Over the span of two years, the team frantically worked days, nights and weekends to 2am or later to put together their first computer, the Trace 7200. The VLIW philosophy says to keep the hardware simple, keep it simple so that it can be manufactured fast and at scale. The TRACE computer had multiple execution units to do arithmetic, logic, floating point numbers, loading from storing to memory, plus a conditional branch unit. If the computer was to be highly parallel, if you want to give the compiler the greatest freedom to do whatever, whenever, then the device had to also be unusually interconnected. Fisher's original paper showed a rough sketch of his vision. The Global Interconnection had 16 clusters which contain the various execution units together with their memories. They are interconnected both to their sides and across with buses or wires. The whole thing looks vaguely abyssal, like as if we are trying to summon the VLIW demon from the architectural depths. Manufacturing such a complicated structure conjured similar terrors. The TRACE computers targeted scientific use cases which demanded larger 64 bit double precision floating point. That is a lot of data. So very large buses, 64 separate copper pins plus the control signals on a connector and with dozens of buses. Remember to count the buses to the sides as well as across. That adds up to thousands of pins. The persistent proliferation of petite pins occasionally made it difficult to plug them all into the computer's backplane. The aforementioned Bob Colwell writes that the hardware lab had a big, heavy and world wary rubber mallet to coax these very expensive pins into their place. They called it the Persuader. Elizabeth Fisher relates a story that happened as the hardware design approached ship date to its it was the night before Christmas. I mean the deadline and the hardware team was trying to fit all the memory registers into the computer, which was a struggle because the hardware had to be so interconnected and space was so limited. If you recall, the register refers to the high speed memory holding the chip's runtime variables and as it does stuff, the spec called for the registers and arithmetic units to be fully interconnected. But the hardware team couldn't fit and connect all that together. There simply was no space. Desperate to make the deadline and with no one on the compiler team around, the hardware people saw no choice but to split the pair of register chips, attaching one to the outsides of the two arithmetic chips units. The compiler team was infuriated because this in certain cases can create a split brain problem where the two arithmetic units see different things and disagree. But it was too late. The silicon was locked in. Fortunately, it wasn't catastrophic, but the core of the TRACE series was not hardware, it was the software. The compiler. As I mentioned, the hardware is so simple because the compiler takes on all the complexity. Anything that can be shifted over was each cycle during operations, the CPU fetches its very large instruction word from its memory with the multiple instructions that the compiler bundled together. After the word unbundles it, its instructions go straight to the units. For this first 7200 computer, each word had up to seven instructions bundled together and is about 256 bits large. Multiflow later released the 14200, which packed together 14 instructions as well as the 28200 with 28 instructions and 1024 bits without hardware circuits to direct the flow of data through the buses or resolve memory conflicts at runtime, like with other CPUs, it is all on the compiler to coordinate that it's got to do everything. Modeled on the Yale Bulldog compiler, the multiflow TRACE compiler turns programs written in Fortran or C into high performance code for the computer. It does this over three phases. In phase one, the compiler takes the Fortran and C code and turns it into an intermediate representation called IL1. The idea is to capture language, specific rules and programmer intent for later phases. In phase two, the compiler takes the IL1 representation and reinterprets it again at a lower level. For the machine, it runs an optimization step to reduce the amount of computation and increase the amount of parallelism. For example, the compiler addresses loops by unrolling them, copying the loop's body for some number of iterations as determined by some heuristic for the scheduler. After cleaning up some variable names, the unrolled loop is ready to be exploited for max parallelism. The output of phase two is another intermediate representation called IL2. We are now finally ready for phase three. This is where the actual trace scheduling algorithm is run and instructions are scheduled. As I said earlier, the algorithm runs through the program code and guesses the a likely path using heuristics or profile data given to the compiler by the user. After scheduling, the algorithm will insert compensation code to cover up any potential off track branches. The result is a compiler that enables the TRACE computers to outperform a risk based MIPS computer in well known benchmarks like linpack. Anywhere from 2 to 10 times. Real world performance, however, did depend on the individual program, which infuriated salespeople on both sides. To no end. Your mileage will vary was a common refrain. It's definitely not perfect. One flagged issue was that the compiler runs very slowly, four times slower than one of DEC's risk based workstations. In part because the Multi Flow compiler creates six representations of the program throughout its three phases. Compilation sometimes took up to three days. Nevertheless, it was a marvel. The TRACE scheduling algorithm worked. The Multiflow compiler team were wizards and produced a software program that was surprisingly reliable for something that had to literally predict the future. Multiflow debuted the Trace series in April 1987 at a glitzy event at the World Trade Center. MultiFlow lined up three beta customers, including the Supercomputer Research Center, a division of the US nsa. They all gave glowing endorsements. Grumman Data Systems said that the computer was running their software two hours after being uncrated. They then took the trace to a 1988 supercomputing conference held in Santa Clara. For years people had told Josh Fisher that VLIW was impossible. But now here it was, running Unix and working like a real computer. The CAD chief at Sikorsky Aircraft, said about the to many of us, what the Multiflow people told us it could do to seemed like black magic. But now not only do you have a reasonably priced supercomputer, but you don't have to rewrite software significantly. To convince people that they were not taking a risk on some radical architecture, Multiflow launched a massive PR and marketing program. The campaign was masterminded by Brian Cohen, PR machine and future angel investor. He threw himself into the task and so completely believed in it that that he even named his first son Trace. It helped that the computer was blazing fast too. The Trace 7200 did 53 million instructions per second and 30 million floating point operations per second. The follow up, 28200, boasted specs four times higher than that. The story also sold well. Fisher was a willing subject with a compelling personal story, and the VLIW technology itself was intriguing. The notion of this compiler correctly guessing some 90% of the branches in a program is eye catching. The computer's debut got covered by a wide variety of press outlets, including a full page in Business Week. All in all, it was a triumphal moment. The impossible computer was real. Multiflow originally targeted scientific customers, university and government labs. Such labs wanted supercomputer like performance at a fraction of the price. A cray would cost maybe 5 million as compared to a trace's 300k. Customers can load in their Fortran coded programs and let the compiler optimize it without additional modifications. Moreover, scientific application programs were thought to have lots of opportunities for parallelism, but the computer turned out to be very useful for commercial users too. By the end of 1989 they sold about 100 machines to 75 customers, and about half of those were commercial, like P and G, Hewlett Packard, Motorola and more. In 1989 Multiflow released an upgraded line of machines, the 7300 series, four times faster than their predecessors. Impressively, those gains were almost entirely achieved with an improved compiler. Just software. Unfortunately, it was also too late. By then the company was already in a financial tailspin that it would not escape from. The mini Superboom had attracted a rogues gallery of players like Convex, Alliant, Siderome and Dec. You can count up to 20 vendors in the market, but in 1987 analysts had estimated the whole market size to be just about $350 million. Considering it might take 20 to 30 million dollars to develop a mini super, you don't need to think a long time to conclude that 20 vendors is too much. The presumption had been that the Crayettes would go after the real deal. But supercomputer vendors like Cray upped their game. At the high end they added the Kray YMP with way more power than the Mini super can offer. Kray then shored up their low end flanks with an extension of the older Kray XMP. Though pricier than a mini super at $14 million, it offered compelling price performance for labs and weather stations that could not afford the Y mp. Moreover, Kray defanged many of the Mini super vendors top selling point by adopting a flavor of Unix called unicos in 1985. This grew their platform and gave national labs confidence that their applications would work on Cray hardware. But it was on the low end where the most serious competition was the killer micros. It is a phrase that emerged in the 1980s that refers to powerful Unix workstations equipped with highly integrated single chip CMOS CP CPUs. Such RISC chips like Sun's Sparc, IBM's RS 6000 MIPS and even Intel's i860 were getting more powerful each year, subsuming computer categories that once existed like minicomputers and the Mini Supers. Put another way, convergence driven by the cpu. Multiflow gained some benefits from having its CPU set up as clusters of discrete compute modules. But this separation also meant that they could not benefit from the exponential scaling of Moore's Law, which granted both size and power advantages. And while Multiflow might have the best software in the business, their hardware was painfully lacking. George Weiss at Gartner Group would say about the technology was good, they put tremendous effort into software, but they needed to duplicate that. On the hardware side, the low end i860 ran at 25Hz and used just a few watts. The trace, on the other hand, ran at 8 Hz and needed multiple kilowatts and big copper buses to distribute that power. In the end, ever faster cycle times let the killer micros make up for any architectural disadvantage. And at $100,000 these workstations price points cannot be beaten. Frankly, big iron mainframes just fundamentally could not keep up with the greatest cost scaling items in human history. Analysts had once estimated that the mini supercomputer market would almost quadruple to over a billion dollars in 1991. That never happened. Instead, starting in the summer of 1988, the whole category began to implode. Vendors resorted to steep price cuts and when that didn't work, exited stage left. Celerity Computing, a San Diego based Unix vendor who tried to enter the space, fell apart and was acquired in 1988 by Floating Point Systems. But Floating Point itself was also struggling. Alliant 2 reported quarterly losses. Cydrome, the only other mini super company pursuing vliw, folded without commercially shipping a product. The losses came fast and hard. When the US market started to crash in 1989, Multiflow found itself on the wrong course. The company lost money from the start, always one step behind Convex and Alliant. They took too long to enter international markets like Japan and Europe, not expanding there until it was way too late. In 1989, Weiss, the Gartner analyst, and added that great, but not game changing performance gave the company little chance to overcome its ecosystem disadvantages. Multiflow never really delivered a dramatic performance, certainly not enough to grab market attention. Convex has a more aggressive sales force, a more sophisticated hardware platform, and had more software ported earlier in the game. As 1989 came to a close, management increasingly focused on an acquisition by DEC as their last chance for survival. DEC deeply evaluated the VLIW technology as a potential platform for future work. And while the technology passed many of their evaluations, powerful voices both in and out of the company shot it down. Particularly those behind a competing high speed computer project called the VAX 9000. In Christmas 1989, Deck started to backpedal Senior, saying that it could not do a deal right then because adding Multiflow's expenses to the income statement would cause them to report their first financial loss. And then in March 1990, Dec told Multiflow that the acquisition deal was truly dead. Two of the company's venture capitalists tried to salvage it, but failed. There was no plan B. They were out of money. With that, the board decided that Multiflow should voluntarily liquidate. At the time, the company had about 160 employees. They gathered for a meeting the next day to hear the news and then got to work disassembling the company. In the end, Multiflow the company did not find the economic success that it desired. But judging by what they were able to do and their influence on the computing world, they succeeded beyond their wildest dreams. And ironically, going out of business perhaps helped spread its ideas farther and wider. The sheer amount of talent that they had gathered was shocking, considering how small they were. Fisher is a winner of the Eckhart Mauchly Award, widely acknowledged as the most prestigious citation for computer architecture. Another winner was the aforementioned Robert Colwell, who joined intel and became the chief architect for iconic chips like the Pentium Pro, Pentium II iii and Pentium 4 CPUs. Man is a legend. To many of these employees, or multifloyds, as they called themselves, the company's failure had less to do with the architecture than the business environment. This thing worked and was capable of incredible performance. So when the business wound down, the talents joined other companies like Hewlett Packard, Intel, DEC, and others, and evangelized their ideas. So VLIW lived on, most famously, or infamously, with Hewlett Packard and then, of course, Intel. But that is a story for another day. All right. All right, everyone. That's it for tonight. Thanks for watching. Subscribe to the channel. Sign up for the Patreon, and I'll see you guys next time.
