Startup Says It Can Make a 100x Faster CPU
In an era of fast-evolving AI accelerators, general purpose CPUs don’t get a lot of love. “If you look at the CPU generation by generation, you see incremental improvements,” says Timo Valtonen, CEO and co-founder of Finland-based Flow Computing.
Valtonen’s goal is to put CPUs back in their rightful, ‘central’ role. In order to do that, he and his team are proposing a new paradigm. Instead of trying to speed up computation by putting 16 identical CPU cores into, say, a laptop, a manufacturer could put 4 standard CPU cores and 64 of Flow Computing’s so-called parallel processing unit (PPU) cores into the same footprint, and achieve up to 100 times better performance. Valtonen and his collaborators laid out their case at the IEEE Hot Chips conference in August.
The PPU provides a speed-up in cases where the computing task is parallelizable, but a traditional CPU isn’t well equipped to take advantage of that parallelism, yet offloading to something like a GPU would be too costly.
“Typically, we say, ‘okay, parallelization is only worthwhile if we have a large workload,’ because otherwise the overhead kills lot of our gains,” says Jörg Keller, professor and chair of parallelism and VLSI at FernUniversität in Hagen, Germany, who is not affiliated with Flow Computing. “And this now changes towards smaller workloads, which means that there are more places in the code where you can apply this parallelization.”
Computing tasks can roughly be broken up into two categories: sequential tasks, where each step depends on the outcome of a previous step, and parallel tasks, which can be done independently. Flow Computing CTO and co-founder Martti Forsell says a single architecture cannot be optimized for both types of tasks. So, the idea is to have separate units that are optimized for each type of task.
“When we have a sequential workload as part of the code, then the CPU part will execute it. And when it comes to parallel parts, then the CPU will assign that part to PPU. Then we have the best of both words,” Forsell says.
According to Forsell, there are four main requirements for a computer architecture that’s optimized for parallelism: tolerating memory latency, which means finding ways to not just sit idle while the next piece of data is being loaded from memory; sufficient bandwidth for communication between so-called threads, chains of processor instructions that are running in parallel; efficient synchronization, which means making sure the parallel parts of the code execute in the correct order; and low-level parallelism, or the ability to use the multiple functional units that actually perform mathematical and logical operations simultaneously. For Flow Computing new approach, “we have redesigned, or started designing an architecture from scratch, from the beginning, for parallel computation,” Forsell says.
Any CPU can be potentially upgraded
To hide the latency of memory access, the PPU implements multi-threading: when each thread calls to memory, another thread can start running while the first thread waits for a response. To optimize bandwidth, the PPU is equipped with a flexible communication network, such that any functional unit can talk to any other one as needed, also allowing for low-level parallelism. To deal with synchronization delays, it utilizes a proprietary algorithm called wave synchronization that is claimed to be up to 10,000 times more efficient than traditional synchronization protocols.
To demonstrate the power of the PPU, Forsell and his collaborators built a proof-of-concept FPGA implementation of their design. The team says that the FPGA performed identically to their simulator, demonstrating that the PPU is functioning as expected. The team performed several comparison studies between their PPU design and existing CPUS. “Up to 100x [improvement] was reached in our preliminary performance comparisons assuming that there would be a silicon implementation of a Flow PPU running at the same speed as one of the compared commercial processors and using our microarchitecture,” Forsell says.
Now, the team is working on a compiler for their PPU, as well as looking for partners in the CPU production space. They are hoping that a large CPU manufacturer will be interested in their product, so that they could work on a co-design. Their PPU can be implemented with any instruction set architecture, so any CPU can be potentially upgraded.
“Now is really the time for this technology to go to market,” says Keller. “Because now we have the necessity of energy efficient computing in mobile devices, and at the same time, we have the need for high computational performance.”