Reading view

There are new articles available, click to refresh the page.

Newest Google and Nvidia Chips Speed AI Training



Nvidia, Oracle, Google, Dell and 13 other companies reported how long it takes their computers to train the key neural networks in use today. Among those results were the first glimpse of Nvidia’s next generation GPU, the B200, and Google’s upcoming accelerator, called Trillium. The B200 posted a doubling of performance on some tests versus today’s workhorse Nvidia chip, the H100. And Trillium delivered nearly a four-fold boost over the chip Google tested in 2023.

The benchmark tests, called MLPerf v4.1, consist of six tasks: recommendation, the pre-training of the large language models (LLM) GPT-3 and BERT-large, the fine tuning of the Llama 2 70B large language model, object detection, graph node classification, and image generation.

Training GPT-3 is such a mammoth task that it’d be impractical to do the whole thing just to deliver a benchmark. Instead, the test is to train it to a point that experts have determined means it is likely to reach the goal if you kept going. For Llama 2 70B, the goal is not to train the LLM from scratch, but to take an already trained model and fine-tune it so it’s specialized in a particular expertise—in this case, government documents. Graph node classification is a type of machine learning used in fraud detection and drug discovery.

As what’s important in AI has evolved, mostly toward using generative AI, the set of tests has changed. This latest version of MLPerf marks a complete changeover in what’s being tested since the benchmark effort began. “At this point all of the original benchmarks have been phased out,” says David Kanter, who leads the benchmark effort at MLCommons. In the previous round it was taking mere seconds to perform some of the benchmarks.

A line graph with one diagonal blue line and many colored and dashed branches rising up from that line. Performance of the best machine learning systems on various benchmarks has outpaced what would be expected if gains were solely from Moore’s Law [blue line]. Solid line represent current benchmarks. Dashed lines represent benchmarks that have now been retired, because they are no longer industrially relevant.MLCommons

According to MLPerf’s calculations, AI training on the new suite of benchmarks is improving at about twice the rate one would expect from Moore’s Law. As the years have gone on, results have plateaued more quickly than they did at the start of MLPerf’s reign. Kanter attributes this mostly to the fact that companies have figured out how to do the benchmark tests on very large systems. Over time, Nvidia, Google, and others have developed software and network technology that allows for near linear scaling—doubling the processors cuts training time roughly in half.

https://public.flourish.studio/visualisation/20196...” width=”100%” alt=”scatter visualization” />

First Nvidia Blackwell training results

This round marked the first training tests for Nvidia’s next GPU architecture, called Blackwell. For the GPT-3 training and LLM fine-tuning, the Blackwell (B200) roughly doubled the performance of the H100 on a per-GPU basis. The gains were a little less robust but still substantial for recommender systems and image generation—64 percent and 62 percent, respectively.

The Blackwell architecture, embodied in the Nvidia B200 GPU, continues an ongoing trend toward using less and less precise numbers to speed up AI. For certain parts of transformer neural networks such as ChatGPT, Llama2, and Stable Diffusion, the Nvidia H100 and H200 use 8-bit floating point numbers. The B200 brings that down to just 4 bits.

Google debuts 6th gen hardware

Google showed the first results for its 6th generation of TPU, called Trillium—which it unveiled only last month—and a second round of results for its 5th generation variant, the Cloud TPU v5p. In the 2023 edition, the search giant entered a different variant of the 5th generation TPU, v5e, designed more for efficiency than performance. Versus the latter, Trillium delivers as much as a 3.8-fold performance boost on the GPT-3 training task.

But versus everyone’s arch-rival Nvidia, things weren’t as rosy. A system made up of 6,144 TPU v5ps reached the GPT-3 training checkpoint in 11.77 minutes, placing a distant second to an 11,616-Nvidia H100 system, which accomplished the task in about 3.44 minutes. That top TPU system was only about 25 seconds faster than an H100 computer half its size.

A Dell Technologies computer fine-tuned the Llama 2 70B large language model using about 75 cents worth of electricity.

In the closest head-to-head comparison between v5p and Trillium, with each system made up of 2048 TPUs, the upcoming Trillium shaved a solid 2 minutes off of the GPT-3 training time, nearly an 8 percent improvement on v5p’s 29.6 minutes. Another difference between the Trillium and v5p entries is that Trillium is paired with AMD Epyc CPUs instead of the v5p’s Intel Xeons.

Google also trained the image generator, Stable Diffusion, with the Cloud TPU v5p. At 2.6 billion parameters, Stable Diffusion is a light enough lift that MLPerf contestants are asked to train it to convergence instead of just to a checkpoint, as with GPT-3. A 1024 TPU system ranked second, finishing the job in 2 minutes 26 seconds, about a minute behind the same size system made up of Nvidia H100s.

https://public.flourish.studio/visualisation/20251...” target=”_blank”>https://public.flourish.studio/visualisation/20251...” width=”100%” alt=”chart visualization” />

Training power is still opaque

The steep energy cost of training neural networks has long been a source of concern. MLPerf is only beginning to measure this. Dell Technologies was the sole entrant in the energy category, with an eight-server system containing 64 Nvidia H100 GPUs and 16 Intel Xeon Platinum CPUs. The only measurement made was in the LLM fine-tuning task (Llama2 70B). The system consumed 16.4 megajoules during its 5-minute run, for an average power of 5.4 kilowatts. That means about 75 cents of electricity at the average cost in the United States.

While it doesn’t say much on its own, the result does potentially provide a ballpark for the power consumption of similar systems. Oracle, for example, reported a close performance result—4 minutes 45 seconds—using the same number and types of CPUs and GPUs.

Machine Learning Might Save Time on Chip Testing



Finished chips coming in from the foundry are subject to a battery of tests. For those destined for critical systems in cars, those tests are particularly extensive and can add 5 to 10 percent to the cost of a chip. But do you really need to do every single test?

Engineers at NXP have developed a machine-learning algorithm that learns the patterns of test results and figures out the subset of tests that are really needed and those that they could safely do without. The NXP engineers described the process at the IEEE International Test Conference in San Diego last week.

NXP makes a wide variety of chips with complex circuitry and advanced chip-making technology, including inverters for EV motors, audio chips for consumer electronics, and key-fob transponders to secure your car. These chips are tested with different signals at different voltages and at different temperatures in a test process called continue-on-fail. In that process, chips are tested in groups and are all subjected to the complete battery, even if some parts fail some of the tests along the way.

Chips were subject to between 41 and 164 tests, and the algorithm was able to recommend removing 42 to 74 percent of those tests.

“We have to ensure stringent quality requirements in the field, so we have to do a lot of testing,” says Mehul Shroff, an NXP Fellow who led the research. But with much of the actual production and packaging of chips outsourced to other companies, testing is one of the few knobs most chip companies can turn to control costs. “What we were trying to do here is come up with a way to reduce test cost in a way that was statistically rigorous and gave us good results without compromising field quality.”

A Test Recommender System

Shroff says the problem has certain similarities to the machine learning-based recommender systems used in e-commerce. “We took the concept from the retail world, where a data analyst can look at receipts and see what items people are buying together,” he says. “Instead of a transaction receipt, we have a unique part identifier and instead of the items that a consumer would purchase, we have a list of failing tests.”

The NXP algorithm then discovered which tests fail together. Of course, what’s at stake for whether a purchaser of bread will want to buy butter is quite different from whether a test of an automotive part at a particular temperature means other tests don’t need to be done. “We need to have 100 percent or near 100 percent certainty,” Shroff says. “We operate in a different space with respect to statistical rigor compared to the retail world, but it’s borrowing the same concept.”

As rigorous as the results are, Shroff says that they shouldn’t be relied upon on their own. You have to “make sure it makes sense from engineering perspective and that you can understand it in technical terms,” he says. “Only then, remove the test.”

Shroff and his colleagues analyzed data obtained from testing seven microcontrollers and applications processors built using advanced chipmaking processes. Depending on which chip was involved, they were subject to between 41 and 164 tests, and the algorithm was able to recommend removing 42 to 74 percent of those tests. Extending the analysis to data from other types of chips led to an even wider range of opportunities to trim testing.

The algorithm is a pilot project for now, and the NXP team is looking to expand it to a broader set of parts, reduce the computational overhead, and make it easier to use.

“Any novel solution that helps in test-time savings without any quality hit is valuable,” says Sriharsha Vinjamury, a principal engineer at Arm. “Reducing test time is essential, as it reduces costs.” He suggests that the NXP algorithm could be integrated with a system that adjust the order of tests, so that failures could be spotted earlier.

This post was updated on 13 November 2024.

U.S. Chip Revival Plan Chooses Sites



Last week the organization tasked with running the the biggest chunk of U.S. CHIPS Act’s US $13 billion R&D program made some significant strides: The National Semiconductor Technology Center (NSTC) released a strategic plan and selected the sites of two of three planned facilities and released a new strategic plan. The locations of the two sites—a “design and collaboration” center in Sunnyvale, Calif., and a lab devoted to advancing the leading edge of chipmaking, in Albany, N.Y.—build on an existing ecosystem at each location, experts say. The location of the third planned center—a chip prototyping and packaging site that could be especially critical for speeding semiconductor startups—is still a matter of speculation.

“The NSTC represents a once-in-a-generation opportunity for the U.S. to accelerate the pace of innovation in semiconductor technology,” Deirdre Hanford, CEO of Natcast, the nonprofit that runs the NSTC centers, said in a statement. According to the strategic plan, which covers 2025 to 2027, the NSTC is meant to accomplish three goals: extend U.S. technology leadership, reduce the time and cost to prototype, and build and sustain a semiconductor workforce development ecosystem. The three centers are meant to do a mix of all three.

New York gets extreme ultraviolet lithography

NSTC plans to direct $825 million into the Albany project. The site will be dedicated to extreme ultraviolet lithography, a technology that’s essential to making the most advanced logic chips. The Albany Nanotech Complex, which has already seen more than $25 billion in investments from the state and industry partners over two decades, will form the heart of the future NSTC center. It already has an EUV lithography machine on site and has begun an expansion to install a next-generation version, called high-NA EUV, which promises to produce even finer chip features. Working with a tool recently installed in Europe, IBM, a long-time tenant of the Albany research facility, reported record yields of copper interconnects built every 21 nanometers, a pitch several nanometers tighter than possible with ordinary EUV.

“It’s fulfilling to see that this ecosystem can be taken to the national and global level through CHIPS Act funding,” said Mukesh Khare, general manager of IBM’s semiconductors division, speaking from the future site of the NSTC EUV center. “It’s the right time, and we have all the ingredients.”

While only a few companies are capable of manufacturing cutting edge logic using EUV, the impact of the NSTC center will be much broader, Khare argues. It will extend down as far as early-stage startups with ideas or materials for improving the chipmaking process “An EUV R&D center doesn’t mean just one machine,” says Khare. “It needs so many machines around it… It’s a very large ecosystem.”

Silicon Valley lands the design center

The design center is tasked with conducting advanced research in chip design, electronic design automation (EDA), chip and system architectures, and hardware security. It will also host the NSTC’s design enablement gateway—a program that provides NSTC members with a secure, cloud-based access to design tools, reference processes and designs, and shared data sets, with the goal of reducing the time and cost of design. Additionally, it will house workforce development, member convening, and administration functions.

Situating the design center in Silicon Valley, with its concentration of research universities, venture capital, and workforce, seems like the obvious choice to many experts. “I can’t think of a better place,” says Patrick Soheili, co-founder of interconnect technology startup Eliyan, which is based in Santa Clara, Calif.

Abhijeet Chakraborty, vice president of engineering in the technology and product group at Silicon Valley-based Synopsys, a leading maker of EDA software, sees Silicon Valley’s expansive tech ecosystem as one of its main advantages in landing the NSTC’s design center. The region concentrates companies and researchers involved in the whole spectrum of the industry from semiconductor process technology to cloud software.

Access to such a broad range of industries is increasingly important for chip design startups, he says. “To design a chip or component these days you need to go from concept to design to validation in an environment that takes care of the entire stack,” he says. It’s prohibitively expensive for a startup to do that alone, so one of Chakraborty’s hopes for the design center is that it will help startups access the design kits and other data needed to operate in this new environment.

Packaging and prototyping still to come

A third promised center for prototyping and packaging is still to come. “The big question is where does the packaging and prototyping go?” says Mark Granahan, cofounder and CEO of Pennsylvania-based power semiconductor startup Ideal Semiconductor. “To me that’s a great opportunity.” He points out that because there is so little packaging technology infrastructure in the United States, any ambitious state or region should have a shot at hosting such a center. One of the original intentions of the act, after all, was to expand the number of regions of the country that are involved in the semiconductor industry.

But that hasn’t stopped some already tech-heavy regions from wanting it. “Oregon offers the strongest ecosystem for such a facility,” a spokesperson for Intel, whose technology development is done there. “The state is uniquely positioned to contribute to the success of the NSTC and help drive technological advancements in the U.S. semiconductor industry.”

As NSTC makes progress, Granahan’s concern is that bureaucracy will expand with it and slow efforts to boost the U.S. chip industry. Already the layers of control are multiplying. The Chips Office at the National Institute of Standards and Technology executes the Act. The NSTC is administered by the nonprofit Natcast, which directs the EUV center, which is in a facility run by another nonprofit, NY CREATES. “We want these things to be agile and make local decisions.”

map visualization

Amazon's Secret Weapon in Chip Design Is Amazon



Big-name makers of processors, especially those geared toward cloud-based AI, such as AMD and Nvidia, have been showing signs of wanting to own more of the business of computing, purchasing makers of software, interconnects, and servers. The hope is that control of the “full stack” will give them an edge in designing what their customers want.

Amazon Web Services (AWS) got there ahead of most of the competition, when they purchased chip designer Annapurna Labs in 2015 and proceeded to design CPUs, AI accelerators, servers, and data centers as a vertically-integrated operation. Ali Saidi, the technical lead for the Graviton series of CPUs, and Rami Sinno, director of engineering at Annapurna Labs, explained the advantage of vertically-integrated design and Amazon-scale and showed IEEE Spectrum around the company’s hardware testing labs in Austin, Tex., on 27 August.

Saidi and Sinno on:

What brought you to Amazon Web Services, Rami?

an older man in an eggplant colored polo shirt posing for a portrait Rami SinnoAWS

Rami Sinno: Amazon is my first vertically integrated company. And that was on purpose. I was working at Arm, and I was looking for the next adventure, looking at where the industry is heading and what I want my legacy to be. I looked at two things:

One is vertically integrated companies, because this is where most of the innovation is—the interesting stuff is happening when you control the full hardware and software stack and deliver directly to customers.

And the second thing is, I realized that machine learning, AI in general, is going to be very, very big. I didn’t know exactly which direction it was going to take, but I knew that there is something that is going to be generational, and I wanted to be part of that. I already had that experience prior when I was part of the group that was building the chips that go into the Blackberries; that was a fundamental shift in the industry. That feeling was incredible, to be part of something so big, so fundamental. And I thought, “Okay, I have another chance to be part of something fundamental.”

Does working at a vertically-integrated company require a different kind of chip design engineer?

Sinno: Absolutely. When I hire people, the interview process is going after people that have that mindset. Let me give you a specific example: Say I need a signal integrity engineer. (Signal integrity makes sure a signal going from point A to point B, wherever it is in the system, makes it there correctly.) Typically, you hire signal integrity engineers that have a lot of experience in analysis for signal integrity, that understand layout impacts, can do measurements in the lab. Well, this is not sufficient for our group, because we want our signal integrity engineers also to be coders. We want them to be able to take a workload or a test that will run at the system level and be able to modify it or build a new one from scratch in order to look at the signal integrity impact at the system level under workload. This is where being trained to be flexible, to think outside of the little box has paid off huge dividends in the way that we do development and the way we serve our customers.

“By the time that we get the silicon back, the software’s done” —Ali Saidi, Annapurna Labs

At the end of the day, our responsibility is to deliver complete servers in the data center directly for our customers. And if you think from that perspective, you’ll be able to optimize and innovate across the full stack. A design engineer or a test engineer should be able to look at the full picture because that’s his or her job, deliver the complete server to the data center and look where best to do optimization. It might not be at the transistor level or at the substrate level or at the board level. It could be something completely different. It could be purely software. And having that knowledge, having that visibility, will allow the engineers to be significantly more productive and delivery to the customer significantly faster. We’re not going to bang our head against the wall to optimize the transistor where three lines of code downstream will solve these problems, right?

Do you feel like people are trained in that way these days?

Sinno: We’ve had very good luck with recent college grads. Recent college grads, especially the past couple of years, have been absolutely phenomenal. I’m very, very pleased with the way that the education system is graduating the engineers and the computer scientists that are interested in the type of jobs that we have for them.

The other place that we have been super successful in finding the right people is at startups. They know what it takes, because at a startup, by definition, you have to do so many different things. People who’ve done startups before completely understand the culture and the mindset that we have at Amazon.

[back to top]

What brought you to AWS, Ali?

a man with a beard wearing a polka dotted button-up shirt posing for a portrait Ali SaidiAWS

Ali Saidi: I’ve been here about seven and a half years. When I joined AWS, I joined a secret project at the time. I was told: “We’re going to build some Arm servers. Tell no one.”

We started with Graviton 1. Graviton 1 was really the vehicle for us to prove that we could offer the same experience in AWS with a different architecture.

The cloud gave us an ability for a customer to try it in a very low-cost, low barrier of entry way and say, “Does it work for my workload?” So Graviton 1 was really just the vehicle demonstrate that we could do this, and to start signaling to the world that we want software around ARM servers to grow and that they’re going to be more relevant.

Graviton 2—announced in 2019—was kind of our first… what we think is a market-leading device that’s targeting general-purpose workloads, web servers, and those types of things.

It’s done very well. We have people running databases, web servers, key-value stores, lots of applications... When customers adopt Graviton, they bring one workload, and they see the benefits of bringing that one workload. And then the next question they ask is, “Well, I want to bring some more workloads. What should I bring?” There were some where it wasn’t powerful enough effectively, particularly around things like media encoding, taking videos and encoding them or re-encoding them or encoding them to multiple streams. It’s a very math-heavy operation and required more [single-instruction multiple data] bandwidth. We need cores that could do more math.

We also wanted to enable the [high-performance computing] market. So we have an instance type called HPC 7G where we’ve got customers like Formula One. They do computational fluid dynamics of how this car is going to disturb the air and how that affects following cars. It’s really just expanding the portfolio of applications. We did the same thing when we went to Graviton 4, which has 96 cores versus Graviton 3’s 64.

[back to top]

How do you know what to improve from one generation to the next?

Saidi: Far and wide, most customers find great success when they adopt Graviton. Occasionally, they see performance that isn’t the same level as their other migrations. They might say “I moved these three apps, and I got 20 percent higher performance; that’s great. But I moved this app over here, and I didn’t get any performance improvement. Why?” It’s really great to see the 20 percent. But for me, in the kind of weird way I am, the 0 percent is actually more interesting, because it gives us something to go and explore with them.

Most of our customers are very open to those kinds of engagements. So we can understand what their application is and build some kind of proxy for it. Or if it’s an internal workload, then we could just use the original software. And then we can use that to kind of close the loop and work on what the next generation of Graviton will have and how we’re going to enable better performance there.

What’s different about designing chips at AWS?

Saidi: In chip design, there are many different competing optimization points. You have all of these conflicting requirements, you have cost, you have scheduling, you’ve got power consumption, you’ve got size, what DRAM technologies are available and when you’re going to intersect them… It ends up being this fun, multifaceted optimization problem to figure out what’s the best thing that you can build in a timeframe. And you need to get it right.

One thing that we’ve done very well is taken our initial silicon to production.

How?

Saidi: This might sound weird, but I’ve seen other places where the software and the hardware people effectively don’t talk. The hardware and software people in Annapurna and AWS work together from day one. The software people are writing the software that will ultimately be the production software and firmware while the hardware is being developed in cooperation with the hardware engineers. By working together, we’re closing that iteration loop. When you are carrying the piece of hardware over to the software engineer’s desk your iteration loop is years and years. Here, we are iterating constantly. We’re running virtual machines in our emulators before we have the silicon ready. We are taking an emulation of [a complete system] and running most of the software we’re going to run.

So by the time that we get to the silicon back [from the foundry], the software’s done. And we’ve seen most of the software work at this point. So we have very high confidence that it’s going to work.

The other piece of it, I think, is just being absolutely laser-focused on what we are going to deliver. You get a lot of ideas, but your design resources are approximately fixed. No matter how many ideas I put in the bucket, I’m not going to be able to hire that many more people, and my budget’s probably fixed. So every idea I throw in the bucket is going to use some resources. And if that feature isn’t really important to the success of the project, I’m risking the rest of the project. And I think that’s a mistake that people frequently make.

Are those decisions easier in a vertically integrated situation?

Saidi: Certainly. We know we’re going to build a motherboard and a server and put it in a rack, and we know what that looks like… So we know the features we need. We’re not trying to build a superset product that could allow us to go into multiple markets. We’re laser-focused into one.

What else is unique about the AWS chip design environment?

Saidi: One thing that’s very interesting for AWS is that we’re the cloud and we’re also developing these chips in the cloud. We were the first company to really push on running [electronic design automation (EDA)] in the cloud. We changed the model from “I’ve got 80 servers and this is what I use for EDA” to “Today, I have 80 servers. If I want, tomorrow I can have 300. The next day, I can have 1,000.”

We can compress some of the time by varying the resources that we use. At the beginning of the project, we don’t need as many resources. We can turn a lot of stuff off and not pay for it effectively. As we get to the end of the project, now we need many more resources. And instead of saying, “Well, I can’t iterate this fast, because I’ve got this one machine, and it’s busy.” I can change that and instead say, “Well, I don’t want one machine; I’ll have 10 machines today.”

Instead of my iteration cycle being two days for a big design like this, instead of being even one day, with these 10 machines I can bring it down to three or four hours. That’s huge.

How important is Amazon.com as a customer?

Saidi: They have a wealth of workloads, and we obviously are the same company, so we have access to some of those workloads in ways that with third parties, we don’t. But we also have very close relationships with other external customers.

So last Prime Day, we said that 2,600 Amazon.com services were running on Graviton processors. This Prime Day, that number more than doubled to 5,800 services running on Graviton. And the retail side of Amazon used over 250,000 Graviton CPUs in support of the retail website and the services around that for Prime Day.

[back to top]

The AI accelerator team is colocated with the labs that test everything from chips through racks of servers. Why?

Sinno: So Annapurna Labs has multiple labs in multiple locations as well. This location here is in Austin… is one of the smaller labs. But what’s so interesting about the lab here in Austin is that you have all of the hardware and many software development engineers for machine learning servers and for Trainium and Inferentia [AWS’s AI chips] effectively co-located on this floor. For hardware developers, engineers, having the labs co-located on the same floor has been very, very effective. It speeds execution and iteration for delivery to the customers. This lab is set up to be self-sufficient with anything that we need to do, at the chip level, at the server level, at the board level. Because again, as I convey to our teams, our job is not the chip; our job is not the board; our job is the full server to the customer.

How does vertical integration help you design and test chips for data-center-scale deployment?

Sinno: It’s relatively easy to create a bar-raising server. Something that’s very high-performance, very low-power. If we create 10 of them, 100 of them, maybe 1,000 of them, it’s easy. You can cherry pick this, you can fix this, you can fix that. But the scale that the AWS is at is significantly higher. We need to train models that require 100,000 of these chips. 100,000! And for training, it’s not run in five minutes. It’s run in hours or days or weeks even. Those 100,000 chips have to be up for the duration. Everything that we do here is to get to that point.

We start from a “what are all the things that can go wrong?” mindset. And we implement all the things that we know. But when you were talking about cloud scale, there are always things that you have not thought of that come up. These are the 0.001-percent type issues.

In this case, we do the debug first in the fleet. And in certain cases, we have to do debugs in the lab to find the root cause. And if we can fix it immediately, we fix it immediately. Being vertically integrated, in many cases we can do a software fix for it. We use our agility to rush a fix while at the same time making sure that the next generation has it already figured out from the get go.

[back to top]

How India Is Starting a Chip Industry From Scratch



In March, India announced a major investment to establish a semiconductor-manufacturing industry. With US $15 billion in investments from companies, state governments, and the central government, India now has plans for several chip-packaging plants and the country’s first modern chip fab as part of a larger effort to grow its electronics industry.

But turning India into a chipmaking powerhouse will also require a substantial investment in R&D. And so the Indian government turned to IEEE Fellow and retired Georgia Tech professor Rao Tummala, a pioneer of some of the chip-packaging technologies that have become critical to modern computers. Tummala spoke with IEEE Spectrum during the IEEE Electronic Component Technology Conference in Denver, Colo., in May.

Rao Tummala


Rao Tummala is a pioneer of semiconductor packaging and a longtime research leader at Georgia Tech.

What are you helping the government of India to develop?

Rao Tummala: I’m helping to develop the R&D side of India’s semiconductor efforts. We picked 12 strategic research areas. If you explore research in those areas, you can make almost any electronic system. For each of those 12 areas, there’ll be one primary center of excellence. And that’ll be typically at an IIT (Indian Institute of Technology) campus. Then there’ll be satellite centers attached to those throughout India. So when we’re done with it, in about five years, I expect to see probably almost all the institutions involved.

Why did you decide to spend your retirement doing this?

Tummala: It’s my giving back. India gave me the best education possible at the right time.

I’ve been going to India and wanting to help for 20 years. But I wasn’t successful until the current government decided they’re going to make manufacturing and semiconductors important for the country. They asked themselves: What would be the need for semiconductors, in 10 years, 20 years, 30 years? And they quickly concluded that if you have 1.4 billion people, each consuming, say, $5,000 worth of electronics each year, it requires billions and billions of dollars’ worth of semiconductors.

“It’s my giving back. India gave me the best education possible at the right time.” —Rao Tummala, advisor to the government of India

What advantages does India have in the global semiconductor space?

Tummala: India has the best educational system in the world for the masses. It produces the very best students in science and engineering at the undergrad level and lots of them. India is already a success in design and software. All the major U.S. tech companies have facilities in India. And they go to India for two reasons. It has a lot of people with a lot of knowledge in the design and software areas, and those people are cheaper [to employ].

What are India’s weaknesses, and is the government response adequate to overcoming them?

Tummala: India is clearly behind in semiconductor manufacturing. It’s behind in knowledge and behind in infrastructure. Government doesn’t solve these problems. All that the government does is set the policies and give the money. This has given companies incentives to come to India, and therefore the semiconductor industry is beginning to flourish.

Will India ever have leading-edge chip fabs?

Tummala: Absolutely. Not only will it have leading-edge fabs, but in about 20 years, it will have the most comprehensive system-level approach of any country, including the United States. In about 10 years, the size of the electronics industry in India will probably have grown about 10 times.

This article appears in the August 2024 print issue as “5 Questions for Rao Tummala.”

Hybrid Bonding Plays Starring Role in 3D Chips



Chipmakers continue to claw for every spare nanometer to continue scaling down circuits, but a technology involving things that are much bigger—hundreds or thousands of nanometers across—could be just as significant over the next five years.

Called hybrid bonding, that technology stacks two or more chips atop one another in the same package. That allows chipmakers to increase the number of transistors in their processors and memories despite a general slowdown in the shrinking of transistors, which once drove Moore’s Law. At the IEEE Electronic Components and Technology Conference (ECTC) this past May in Denver, research groups from around the world unveiled a variety of hard-fought improvements to the technology, with a few showing results that could lead to a record density of connections between 3D stacked chips: some 7 million links per square millimeter of silicon.

All those connections are needed because of the new nature of progress in semiconductors, Intel’s Yi Shi told engineers at ECTC. Moore’s Law is now governed by a concept called system technology co-optimization, or STCO, whereby a chip’s functions, such as cache memory, input/output, and logic, are fabricated separately using the best manufacturing technology for each. Hybrid bonding and other advanced packaging tech can then be used to assemble these subsystems so that they work every bit as well as a single piece of silicon. But that can happen only when there’s a high density of connections that can shuttle bits between the separate pieces of silicon with little delay or energy consumption.

Out of all the advanced-packaging technologies, hybrid bonding provides the highest density of vertical connections. Consequently, it is the fastest growing segment of the advanced-packaging industry, says Gabriela Pereira, technology and market analyst at Yole Group. The overall market is set to more than triple to US $38 billion by 2029, according to Yole, which projects that hybrid bonding will make up about half the market by then, although today it’s just a small portion.

In hybrid bonding, copper pads are built on the top face of each chip. The copper is surrounded by insulation, usually silicon oxide, and the pads themselves are slightly recessed from the surface of the insulation. After the oxide is chemically modified, the two chips are then pressed together face-to-face, so that the recessed pads on each align. This sandwich is then slowly heated, causing the copper to expand across the gap and fuse, connecting the two chips.

Making Hybrid Bonding Better


An illustration showing how to make hybrid bonding better
  1. Hybrid bonding starts with two wafers or a chip and a wafer facing each other. The mating surfaces are covered in oxide insulation and slightly recessed copper pads connected to the chips’ interconnect layers.
  2. The wafers are pressed together to form an initial bond between the oxides.
  3. The stacked wafers are then heated slowly, strongly linking the oxides and expanding the copper to form an electrical connection.
  1. To form more secure bonds, engineers are flattening the last few nanometers of oxide. Even slight bulges or warping can break dense connections.
  2. The copper must be recessed from the surface of the oxide just the right amount. Too much and it will fail to form a connection. Too little and it will push the wafers apart. Researchers are working on ways to control the level of copper down to single atomic layers.
  3. The initial links between the wafers are weak hydrogen bonds. After annealing, the links are strong covalent bonds [below]. Researchers expect that using different types of surfaces, such as silicon carbonitride, which has more locations to form chemical bonds, will lead to stronger links between the wafers.
  4. The final step in hybrid bonding can take hours and require high temperatures. Researchers hope to lower the temperature and shorten the process time.
  5. Although the copper from both wafers presses together to form an electrical connection, the metal’s grain boundaries generally do not cross from one side to the other. Researchers are trying to cause large single grains of copper to form across the boundary to improve conductance and stability.

Hybrid bonding can either attach individual chips of one size to a wafer full of chips of a larger size or bond two full wafers of chips of the same size. Thanks in part to its use in camera chips, the latter process is more mature than the former, Pereira says. For example, engineers at the European microelectronics-research institute Imec have created some of the most dense wafer-on-wafer bonds ever, with a bond-to-bond distance (or pitch) of just 400 nanometers. But Imec managed only a 2-micrometer pitch for chip-on-wafer bonding.

The latter is a huge improvement over the advanced 3D chips in production today, which have connections about 9 μm apart. And it’s an even bigger leap over the predecessor technology: “microbumps” of solder, which have pitches in the tens of micrometers.

“With the equipment available, it’s easier to align wafer to wafer than chip to wafer. Most processes for microelectronics are made for [full] wafers,” says Jean-Charles Souriau, scientific leader in integration and packaging at the French research organization CEA Leti. But it’s chip-on-wafer (or die-to-wafer) that’s making a splash in high-end processors such as those from AMD, where the technique is used to assemble compute cores and cache memory in its advanced CPUs and AI accelerators.

In pushing for tighter and tighter pitches for both scenarios, researchers are focused on making surfaces flatter, getting bound wafers to stick together better, and cutting the time and complexity of the whole process. Getting it right could revolutionize how chips are designed.

WoW, Those Are Some Tight Pitches

The recent wafer-on-wafer (WoW) research that achieved the tightest pitches—from 360 nm to 500 nm—involved a lot of effort on one thing: flatness. To bond two wafers together with 100-nm-level accuracy, the whole wafer has to be nearly perfectly flat. If it’s bowed or warped to the slightest degree, whole sections won’t connect.

Flattening wafers is the job of a process called chemical mechanical planarization, or CMP. It’s essential to chipmaking generally, especially for producing the layers of interconnects above the transistors.

“CMP is a key parameter we have to control for hybrid bonding,” says Souriau. The results presented at ECTC show CMP being taken to another level, not just flattening across the wafer but reducing mere nanometers of roundness on the insulation between the copper pads to ensure better connections.

“It’s difficult to say what the limit will be. Things are moving very fast.” —Jean-Charles Souriau, CEA Leti

Other researchers focused on ensuring those flattened parts stick together strongly enough. They did so by experimenting with different surface materials such as silicon carbonitride instead of silicon oxide and by using different schemes to chemically activate the surface. Initially, when wafers or dies are pressed together, they are held in place with relatively weak hydrogen bonds, and the concern is whether everything will stay in place during further processing steps. After attachment, wafers and chips are then heated slowly, in a process called annealing, to form stronger chemical bonds. Just how strong these bonds are—and even how to figure that out—was the subject of much of the research presented at ECTC.

Part of that final bond strength comes from the copper connections. The annealing step expands the copper across the gap to form a conductive bridge. Controlling the size of that gap is key, explains Samsung’s Seung Ho Hahn. Too little expansion, and the copper won’t fuse. Too much, and the wafers will be pushed apart. It’s a matter of nanometers, and Hahn reported research on a new chemical process that he hopes to use to get it just right by etching away the copper a single atomic layer at a time.

The quality of the connection counts, too. The metals in chip interconnects are not a single crystal; instead they’re made up of many grains, crystals oriented in different directions. Even after the copper expands, the metal’s grain boundaries often don’t cross from one side to another. Such a crossing should reduce a connection’s electrical resistance and boost its reliability. Researchers at Tohoku University in Japan reported a new metallurgical scheme that could finally generate large, single grains of copper that cross the boundary. “This is a drastic change,” says Takafumi Fukushima, an associate professor at Tohoku. “We are now analyzing what underlies it.”

Other experiments discussed at ECTC focused on streamlining the bonding process. Several sought to reduce the annealing temperature needed to form bonds—typically around 300 °C—as to minimize any risk of damage to the chips from the prolonged heating. Researchers from Applied Materials presented progress on a method to radically reduce the time needed for annealing—from hours to just 5 minutes.

CoWs That Are Outstanding in the Field

A series of gray-scale images of the corner of an object at increasing magnification. Imec used plasma etching to dice up chips and give them chamfered corners. The technique relieves mechanical stress that could interfere with bonding.Imec

Chip-on-wafer (CoW) hybrid bonding is more useful to makers of advanced CPUs and GPUs at the moment: It allows chipmakers to stack chiplets of different sizes and to test each chip before it’s bound to another, ensuring that they aren’t dooming an expensive CPU with a single flawed part.

But CoW comes with all of the difficulties of WoW and fewer of the options to alleviate them. For example, CMP is designed to flatten wafers, not individual dies. Once dies have been cut from their source wafer and tested, there’s less that can be done to improve their readiness for bonding.

Nevertheless, researchers at Intel reported CoW hybrid bonds with a 3-μm pitch, and, as mentioned, a team at Imec managed 2 μm, largely by making the transferred dies very flat while they were still attached to the wafer and keeping them extra clean throughout the process. Both groups used plasma etching to dice up the dies instead of the usual method, which uses a specialized blade. Unlike a blade, plasma etching doesn’t lead to chipping at the edges, which creates debris that could interfere with connections. It also allowed the Imec group to shape the die, making chamfered corners that relieve mechanical stress that could break connections.

CoW hybrid bonding is going to be critical to the future of high-bandwidth memory (HBM), according to several researchers at ECTC. HBM is a stack of DRAM dies—currently 8 to 12 dies high—atop a control-logic chip. Often placed within the same package as high-end GPUs, HBM is crucial to handling the tsunami of data needed to run large language models like ChatGPT. Today, HBM dies are stacked using microbump technology, so there are tiny balls of solder surrounded by an organic filler between each layer.

But with AI pushing memory demand even higher, DRAM makers want to stack 20 layers or more in HBM chips. The volume that microbumps take up means that these stacks will soon be too tall to fit properly in the package with GPUs. Hybrid bonding would shrink the height of HBMs and also make it easier to remove excess heat from the package, because there would be less thermal resistance between its layers.

“I think it’s possible to make a more-than-20-layer stack using this technology.” —Hyeonmin Lee, Samsung

At ECTC, Samsung engineers showed that hybrid bonding could yield a 16-layer HBM stack. “I think it’s possible to make a more-than-20-layer stack using this technology,” says Hyeonmin Lee, a senior engineer at Samsung. Other new CoW technology could also help bring hybrid bonding to high-bandwidth memory. Researchers at CEA Leti are exploring what’s known as self-alignment technology, says Souriau. That would help ensure good CoW connections using just chemical processes. Some parts of each surface would be made hydrophobic and some hydrophilic, resulting in surfaces that would slide into place automatically.

At ECTC, researchers from Tohoku University and Yamaha Robotics reported work on a similar scheme, using the surface tension of water to align 5-μm pads on experimental DRAM chips with better than 50-nm accuracy.

The Bounds of Hybrid Bonding

Researchers will almost certainly keep reducing the pitch of hybrid-bonding connections. A 200-nm WoW pitch is not just possible but desirable, Han-Jong Chia, a project manager for pathfinding systems at Taiwan Semiconductor Manufacturing Co. , told engineers at ECTC. Within two years, TSMC plans to introduce a technology called backside power delivery. (Intel plans the same for the end of this year.) That’s a technology that puts the chip’s chunky power-delivery interconnects below the surface of the silicon instead of above it. With those power conduits out of the way, the uppermost levels can connect better to smaller hybrid-bonding bond pads, TSMC researchers calculate. Backside power delivery with 200-nm bond pads would cut down the capacitance of 3D connections so much that a measure of energy efficiency and signal speed would be as much as eight times better than what can be achieved with 400-nm bond pads.

Black squares dot most of the top of an orange metallic disc. Chip-on-wafer hybrid bonding is more useful than wafer-on-wafer bonding, in that it can place dies of one size onto a wafer of larger dies. However, the density of connections that can be achieved is lower than for wafer-on-wafer bonding.Imec

At some point in the future, if bond pitches narrow even further, Chia suggests, it might become practical to “fold” blocks of circuitry so they are built across two wafers. That way some of what are now long connections within the block might be able to take a vertical shortcut, potentially speeding computations and lowering power consumption.

And hybrid bonding may not be limited to silicon. “Today there is a lot of development in silicon-to-silicon wafers, but we are also looking to do hybrid bonding between gallium nitride and silicon wafers and glass wafers…everything on everything,” says CEA Leti’s Souriau. His organization even presented research on hybrid bonding for quantum-computing chips, which involves aligning and bonding superconducting niobium instead of copper.

“It’s difficult to say what the limit will be,” Souriau says. “Things are moving very fast.”

This article was updated on 11 August 2024.

This article appears in the September 2024 print issue as “The Copper Connection.”

❌