Normal view

There are new articles available, click to refresh the page.

Before yesterdayMain stream

IEEE Spectrum Recent Content full text
Amazon's Secret Weapon in Chip Design Is Amazon
15 September 2024 at 15:00

Amazon's Secret Weapon in Chip Design Is Amazon

IEEE Spectrum Recent Content full text

By: Samuel K. Moore

15 September 2024 at 15:00

Big-name makers of processors, especially those geared toward cloud-based AI, such as AMD and Nvidia, have been showing signs of wanting to own more of the business of computing, purchasing makers of software, interconnects, and servers. The hope is that control of the “full stack” will give them an edge in designing what their customers want.

Amazon Web Services (AWS) got there ahead of most of the competition, when they purchased chip designer Annapurna Labs in 2015 and proceeded to design CPUs, AI accelerators, servers, and data centers as a vertically-integrated operation. Ali Saidi, the technical lead for the Graviton series of CPUs, and Rami Sinno, director of engineering at Annapurna Labs, explained the advantage of vertically-integrated design and Amazon-scale and showed IEEE Spectrum around the company’s hardware testing labs in Austin, Tex., on 27 August.

Saidi and Sinno on:

What brought you to Amazon Web Services, Rami?

an older man in an eggplant colored polo shirt posing for a portrait Rami SinnoAWS

Rami Sinno: Amazon is my first vertically integrated company. And that was on purpose. I was working at Arm, and I was looking for the next adventure, looking at where the industry is heading and what I want my legacy to be. I looked at two things:

One is vertically integrated companies, because this is where most of the innovation is—the interesting stuff is happening when you control the full hardware and software stack and deliver directly to customers.

And the second thing is, I realized that machine learning, AI in general, is going to be very, very big. I didn’t know exactly which direction it was going to take, but I knew that there is something that is going to be generational, and I wanted to be part of that. I already had that experience prior when I was part of the group that was building the chips that go into the Blackberries; that was a fundamental shift in the industry. That feeling was incredible, to be part of something so big, so fundamental. And I thought, “Okay, I have another chance to be part of something fundamental.”

Does working at a vertically-integrated company require a different kind of chip design engineer?

Sinno: Absolutely. When I hire people, the interview process is going after people that have that mindset. Let me give you a specific example: Say I need a signal integrity engineer. (Signal integrity makes sure a signal going from point A to point B, wherever it is in the system, makes it there correctly.) Typically, you hire signal integrity engineers that have a lot of experience in analysis for signal integrity, that understand layout impacts, can do measurements in the lab. Well, this is not sufficient for our group, because we want our signal integrity engineers also to be coders. We want them to be able to take a workload or a test that will run at the system level and be able to modify it or build a new one from scratch in order to look at the signal integrity impact at the system level under workload. This is where being trained to be flexible, to think outside of the little box has paid off huge dividends in the way that we do development and the way we serve our customers.

“By the time that we get the silicon back, the software’s done” —Ali Saidi, Annapurna Labs

At the end of the day, our responsibility is to deliver complete servers in the data center directly for our customers. And if you think from that perspective, you’ll be able to optimize and innovate across the full stack. A design engineer or a test engineer should be able to look at the full picture because that’s his or her job, deliver the complete server to the data center and look where best to do optimization. It might not be at the transistor level or at the substrate level or at the board level. It could be something completely different. It could be purely software. And having that knowledge, having that visibility, will allow the engineers to be significantly more productive and delivery to the customer significantly faster. We’re not going to bang our head against the wall to optimize the transistor where three lines of code downstream will solve these problems, right?

Do you feel like people are trained in that way these days?

Sinno: We’ve had very good luck with recent college grads. Recent college grads, especially the past couple of years, have been absolutely phenomenal. I’m very, very pleased with the way that the education system is graduating the engineers and the computer scientists that are interested in the type of jobs that we have for them.

The other place that we have been super successful in finding the right people is at startups. They know what it takes, because at a startup, by definition, you have to do so many different things. People who’ve done startups before completely understand the culture and the mindset that we have at Amazon.

[back to top]

What brought you to AWS, Ali?

a man with a beard wearing a polka dotted button-up shirt posing for a portrait Ali SaidiAWS

Ali Saidi: I’ve been here about seven and a half years. When I joined AWS, I joined a secret project at the time. I was told: “We’re going to build some Arm servers. Tell no one.”

We started with Graviton 1. Graviton 1 was really the vehicle for us to prove that we could offer the same experience in AWS with a different architecture.

The cloud gave us an ability for a customer to try it in a very low-cost, low barrier of entry way and say, “Does it work for my workload?” So Graviton 1 was really just the vehicle demonstrate that we could do this, and to start signaling to the world that we want software around ARM servers to grow and that they’re going to be more relevant.

Graviton 2—announced in 2019—was kind of our first… what we think is a market-leading device that’s targeting general-purpose workloads, web servers, and those types of things.

It’s done very well. We have people running databases, web servers, key-value stores, lots of applications... When customers adopt Graviton, they bring one workload, and they see the benefits of bringing that one workload. And then the next question they ask is, “Well, I want to bring some more workloads. What should I bring?” There were some where it wasn’t powerful enough effectively, particularly around things like media encoding, taking videos and encoding them or re-encoding them or encoding them to multiple streams. It’s a very math-heavy operation and required more [single-instruction multiple data] bandwidth. We need cores that could do more math.

We also wanted to enable the [high-performance computing] market. So we have an instance type called HPC 7G where we’ve got customers like Formula One. They do computational fluid dynamics of how this car is going to disturb the air and how that affects following cars. It’s really just expanding the portfolio of applications. We did the same thing when we went to Graviton 4, which has 96 cores versus Graviton 3’s 64.

[back to top]

How do you know what to improve from one generation to the next?

Saidi: Far and wide, most customers find great success when they adopt Graviton. Occasionally, they see performance that isn’t the same level as their other migrations. They might say “I moved these three apps, and I got 20 percent higher performance; that’s great. But I moved this app over here, and I didn’t get any performance improvement. Why?” It’s really great to see the 20 percent. But for me, in the kind of weird way I am, the 0 percent is actually more interesting, because it gives us something to go and explore with them.

Most of our customers are very open to those kinds of engagements. So we can understand what their application is and build some kind of proxy for it. Or if it’s an internal workload, then we could just use the original software. And then we can use that to kind of close the loop and work on what the next generation of Graviton will have and how we’re going to enable better performance there.

What’s different about designing chips at AWS?

Saidi: In chip design, there are many different competing optimization points. You have all of these conflicting requirements, you have cost, you have scheduling, you’ve got power consumption, you’ve got size, what DRAM technologies are available and when you’re going to intersect them… It ends up being this fun, multifaceted optimization problem to figure out what’s the best thing that you can build in a timeframe. And you need to get it right.

One thing that we’ve done very well is taken our initial silicon to production.

How?

Saidi: This might sound weird, but I’ve seen other places where the software and the hardware people effectively don’t talk. The hardware and software people in Annapurna and AWS work together from day one. The software people are writing the software that will ultimately be the production software and firmware while the hardware is being developed in cooperation with the hardware engineers. By working together, we’re closing that iteration loop. When you are carrying the piece of hardware over to the software engineer’s desk your iteration loop is years and years. Here, we are iterating constantly. We’re running virtual machines in our emulators before we have the silicon ready. We are taking an emulation of [a complete system] and running most of the software we’re going to run.

So by the time that we get to the silicon back [from the foundry], the software’s done. And we’ve seen most of the software work at this point. So we have very high confidence that it’s going to work.

The other piece of it, I think, is just being absolutely laser-focused on what we are going to deliver. You get a lot of ideas, but your design resources are approximately fixed. No matter how many ideas I put in the bucket, I’m not going to be able to hire that many more people, and my budget’s probably fixed. So every idea I throw in the bucket is going to use some resources. And if that feature isn’t really important to the success of the project, I’m risking the rest of the project. And I think that’s a mistake that people frequently make.

Are those decisions easier in a vertically integrated situation?

Saidi: Certainly. We know we’re going to build a motherboard and a server and put it in a rack, and we know what that looks like… So we know the features we need. We’re not trying to build a superset product that could allow us to go into multiple markets. We’re laser-focused into one.

What else is unique about the AWS chip design environment?

Saidi: One thing that’s very interesting for AWS is that we’re the cloud and we’re also developing these chips in the cloud. We were the first company to really push on running [electronic design automation (EDA)] in the cloud. We changed the model from “I’ve got 80 servers and this is what I use for EDA” to “Today, I have 80 servers. If I want, tomorrow I can have 300. The next day, I can have 1,000.”

We can compress some of the time by varying the resources that we use. At the beginning of the project, we don’t need as many resources. We can turn a lot of stuff off and not pay for it effectively. As we get to the end of the project, now we need many more resources. And instead of saying, “Well, I can’t iterate this fast, because I’ve got this one machine, and it’s busy.” I can change that and instead say, “Well, I don’t want one machine; I’ll have 10 machines today.”

Instead of my iteration cycle being two days for a big design like this, instead of being even one day, with these 10 machines I can bring it down to three or four hours. That’s huge.

How important is Amazon.com as a customer?

Saidi: They have a wealth of workloads, and we obviously are the same company, so we have access to some of those workloads in ways that with third parties, we don’t. But we also have very close relationships with other external customers.

So last Prime Day, we said that 2,600 Amazon.com services were running on Graviton processors. This Prime Day, that number more than doubled to 5,800 services running on Graviton. And the retail side of Amazon used over 250,000 Graviton CPUs in support of the retail website and the services around that for Prime Day.

[back to top]

The AI accelerator team is colocated with the labs that test everything from chips through racks of servers. Why?

Sinno: So Annapurna Labs has multiple labs in multiple locations as well. This location here is in Austin… is one of the smaller labs. But what’s so interesting about the lab here in Austin is that you have all of the hardware and many software development engineers for machine learning servers and for Trainium and Inferentia [AWS’s AI chips] effectively co-located on this floor. For hardware developers, engineers, having the labs co-located on the same floor has been very, very effective. It speeds execution and iteration for delivery to the customers. This lab is set up to be self-sufficient with anything that we need to do, at the chip level, at the server level, at the board level. Because again, as I convey to our teams, our job is not the chip; our job is not the board; our job is the full server to the customer.

How does vertical integration help you design and test chips for data-center-scale deployment?

Sinno: It’s relatively easy to create a bar-raising server. Something that’s very high-performance, very low-power. If we create 10 of them, 100 of them, maybe 1,000 of them, it’s easy. You can cherry pick this, you can fix this, you can fix that. But the scale that the AWS is at is significantly higher. We need to train models that require 100,000 of these chips. 100,000! And for training, it’s not run in five minutes. It’s run in hours or days or weeks even. Those 100,000 chips have to be up for the duration. Everything that we do here is to get to that point.

We start from a “what are all the things that can go wrong?” mindset. And we implement all the things that we know. But when you were talking about cloud scale, there are always things that you have not thought of that come up. These are the 0.001-percent type issues.

In this case, we do the debug first in the fleet. And in certain cases, we have to do debugs in the lab to find the root cause. And if we can fix it immediately, we fix it immediately. Being vertically integrated, in many cases we can do a software fix for it. We use our agility to rush a fix while at the same time making sure that the next generation has it already figured out from the get go.

[back to top]

IEEE Spectrum Recent Content full text
How India Is Starting a Chip Industry From Scratch
28 July 2024 at 17:01

How India Is Starting a Chip Industry From Scratch

IEEE Spectrum Recent Content full text

By: Samuel K. Moore

28 July 2024 at 17:01

In March, India announced a major investment to establish a semiconductor-manufacturing industry. With US $15 billion in investments from companies, state governments, and the central government, India now has plans for several chip-packaging plants and the country’s first modern chip fab as part of a larger effort to grow its electronics industry.

But turning India into a chipmaking powerhouse will also require a substantial investment in R&D. And so the Indian government turned to IEEE Fellow and retired Georgia Tech professor Rao Tummala, a pioneer of some of the chip-packaging technologies that have become critical to modern computers. Tummala spoke with IEEE Spectrum during the IEEE Electronic Component Technology Conference in Denver, Colo., in May.

Rao Tummala

Rao Tummala is a pioneer of semiconductor packaging and a longtime research leader at Georgia Tech.

What are you helping the government of India to develop?

Rao Tummala: I’m helping to develop the R&D side of India’s semiconductor efforts. We picked 12 strategic research areas. If you explore research in those areas, you can make almost any electronic system. For each of those 12 areas, there’ll be one primary center of excellence. And that’ll be typically at an IIT (Indian Institute of Technology) campus. Then there’ll be satellite centers attached to those throughout India. So when we’re done with it, in about five years, I expect to see probably almost all the institutions involved.

Why did you decide to spend your retirement doing this?

Tummala: It’s my giving back. India gave me the best education possible at the right time.

I’ve been going to India and wanting to help for 20 years. But I wasn’t successful until the current government decided they’re going to make manufacturing and semiconductors important for the country. They asked themselves: What would be the need for semiconductors, in 10 years, 20 years, 30 years? And they quickly concluded that if you have 1.4 billion people, each consuming, say, $5,000 worth of electronics each year, it requires billions and billions of dollars’ worth of semiconductors.

“It’s my giving back. India gave me the best education possible at the right time.” —Rao Tummala, advisor to the government of India

What advantages does India have in the global semiconductor space?

Tummala: India has the best educational system in the world for the masses. It produces the very best students in science and engineering at the undergrad level and lots of them. India is already a success in design and software. All the major U.S. tech companies have facilities in India. And they go to India for two reasons. It has a lot of people with a lot of knowledge in the design and software areas, and those people are cheaper [to employ].

What are India’s weaknesses, and is the government response adequate to overcoming them?

Tummala: India is clearly behind in semiconductor manufacturing. It’s behind in knowledge and behind in infrastructure. Government doesn’t solve these problems. All that the government does is set the policies and give the money. This has given companies incentives to come to India, and therefore the semiconductor industry is beginning to flourish.

Will India ever have leading-edge chip fabs?

Tummala: Absolutely. Not only will it have leading-edge fabs, but in about 20 years, it will have the most comprehensive system-level approach of any country, including the United States. In about 10 years, the size of the electronics industry in India will probably have grown about 10 times.

This article appears in the August 2024 print issue as “5 Questions for Rao Tummala.”

IEEE Spectrum Recent Content full text
Intel’s Latest FinFET Is Key to Its Foundry Plans
25 June 2024 at 19:34

Intel’s Latest FinFET Is Key to Its Foundry Plans

IEEE Spectrum Recent Content full text

By: Samuel K. Moore

25 June 2024 at 19:34

Last week at VLSI Symposium, Intel detailed the manufacturing process that will form the foundation of its foundry service for high-performance data center customers. For the same power consumption, the Intel 3 process results in an 18 percent performance gain over the previous process, Intel 4. On the company’s roadmap, Intel 3 is the last to use the fin field-effect transistor (FinFET) structure, which the company pioneered in 2011. But it also includes Intel’s first use of a technology that is essential to its plans long after the FinFET is no longer cutting edge. What’s more, the technology is crucial to the company’s plans to become a foundry and make high-performance chips for other companies.

Called dipole work-function metal, it allows a chip designer to select transistors of several different threshold voltages. Threshold voltage is the level at which a device switches on or off. With the Intel 3 process, a single chip can include devices having any of four tightly-controlled threshold voltages. That’s important because different functions operate best with different threshold voltages. Cache memory, for example, typically demands devices with a high threshold voltage to prevent current leakage that wastes power. While other circuits might need the fastest switching devices, with the lowest threshold voltage.

Threshold voltage is set by the transistor’s gate stack, the layer of metal and insulation that controls the flow of current through the transistor. Historically, “the thickness of the metals determines the threshold voltage,” explains Walid Hafez, vice president of foundry technology development at Intel. “The thicker that work function metal is, the lower the threshold voltage is.” But this dependence on transistor geometry comes with some drawbacks as devices and circuits scale down.

Small deviations in the manufacturing process can alter the volume of the metal in the gate, leading to a somewhat broad range of threshold voltages. And that’s where the Intel 3 process exemplifies the change from Intel making chips only for itself to running as a foundry.

“The way an external foundry operates is very different” from an integrated device manufacturer like Intel was until recently, says Hafez. Foundry customers “need different things… One of those things they need is very tight variation of threshold voltage.”

Intel is different; even without the tight threshold voltage tolerances, it can sell all its parts by steering the best performing ones toward its datacenter business and the lower-performing ones in other market segments.

“A lot of external customers don’t do that,” he says. If a chip doesn’t meet their constraints, they may have to chuck it. “So for Intel 3 to be successful in the foundry space, it has to have those very tight variations.”

Dipoles ever after

Dipole work function materials guarantee the needed control over threshold voltage without worrying about how much room you have in the gate. It’s a proprietary mix of metals and other materials that, despite being only angstroms thick, has a powerful effect on a transistor’s silicon channel.

black and white image of two lines sticking up with lines going around them Intel’s use of dipole work-function materials means the gate surrounding each fin in a FinFET is thinner.Intel

Like the old, thick metal gate, the new mix of materials electrostatically alters the silicon’s band structure to shift the threshold voltage. But it does so by inducing a dipole—a separation of charge—in the thin insulation between it and the silicon.

Because foundry customers were demanding tight control of Intel, it’s likely that competitors TSMC and Samsung already use dipoles in their latest FinFET processes. What exactly such structures are made of is a trade secret, but lanthanum is a component in earlier research, and it was the key ingredient in other research presented by the Belgium-based microelectronics research center, Imec. That research was concerned with how best to build the material around stacks of horizontal silicon ribbons instead of one or two vertical fins.

In these devices, called nanosheets or gate all-around transistors, there are mere nanometers between each ribbon of silicon, so dipoles are a necessity. Samsung has already introduced a nanosheet process, and Intel’s, called 20A, is scheduled for later this year. Introducing dipole work function at Intel 3 helps get 20A and its successor 18A into a more mature state, says Hafez.

Flavors of Intel 3

Dipole work-function was not the only technology behind the 18 percent boost Intel 3 delivers over its predecessor. Among them are more perfectly formed fins, more sharply defined contacts to the transistor, and lower resistance and capacitance in the interconnects. (Hafez details all that here.)

Intel is using the process to build its Xeon 6 CPUs. And the company plans to offer customers three variations on the technology, including one, 3-PT, with 9-micrometer through-silicon-vias for use in 3D stacking. “We expect Intel 3-PT to be the backbone of our foundry processes for some time to come,” says Hafez.

Nvidia Conquers Latest AI Tests

IEEE Spectrum Recent Content full text

By: Samuel K. Moore

12 June 2024 at 17:00

For years, Nvidia has dominated many machine learning benchmarks, and now there are two more notches in its belt.

MLPerf, the AI benchmarking suite sometimes called “the Olympics of machine learning,” has released a new set of training tests to help make more and better apples-to-apples comparisons between competing computer systems. One of MLPerf’s new tests concerns fine-tuning of large language models, a process that takes an existing trained model and trains it a bit more with specialized knowledge to make it fit for a particular purpose. The other is for graph neural networks, a type of machine learning behind some literature databases, fraud detection in financial systems, and social networks.

Even with the additions and the participation of computers using Google’s and Intel’s AI accelerators, systems powered by Nvidia’s Hopper architecture dominated the results once again. One system that included 11,616 Nvidia H100 GPUs—the largest collection yet—topped each of the nine benchmarks, setting records in five of them (including the two new benchmarks).

“If you just throw hardware at the problem, it’s not a given that you’re going to improve.” —Dave Salvator, Nvidia

The 11,616-H100 system is “the biggest we’ve ever done,” says Dave Salvator, director of accelerated computing products at Nvidia. It smashed through the GPT-3 training trial in less than 3.5 minutes. A 512-GPU system, for comparison, took about 51 minutes. (Note that the GPT-3 task is not a full training, which could take weeks and cost millions of dollars. Instead, the computers train on a representative portion of the data, at an agreed-upon point well before completion.)

Compared to Nvidia’s largest entrant on GPT-3 last year, a 3,584 H100 computer, the 3.5-minute result represents a 3.2-fold improvement. You might expect that just from the difference in the size of these systems, but in AI computing that isn’t always the case, explains Salvator. “If you just throw hardware at the problem, it’s not a given that you’re going to improve,” he says.

“We are getting essentially linear scaling,” says Salvator. By that he means that twice as many GPUs lead to a halved training time. “[That] represents a great achievement from our engineering teams,” he adds.

Competitors are also getting closer to linear scaling. This round Intel deployed a system using 1,024 GPUs that performed the GPT-3 task in 67 minutes versus a computer one-fourth the size that took 224 minutes six months ago. Google’s largest GPT-3 entry used 12-times the number of TPU v5p accelerators as its smallest entry and performed its task nine times as fast.

Linear scaling is going to be particularly important for upcoming “AI factories” housing 100,000 GPUs or more, Salvator says. He says to expect one such data center to come online this year, and another, using Nvidia’s next architecture, Blackwell, to startup in 2025.

Nvidia’s streak continues

Nvidia continued to boost training times despite using the same architecture, Hopper, as it did in last year’s training results. That’s all down to software improvements, says Salvator. “Typically, we’ll get a 2-2.5x [boost] from software after a new architecture is released,” he says.

For GPT-3 training, Nvidia logged a 27 percent improvement from the June 2023 MLPerf benchmarks. Salvator says there were several software changes behind the boost. For example, Nvidia engineers tuned up Hopper’s use of less accurate, 8-bit floating point operations by trimming unnecessary conversions between 8-bit and 16-bit numbers and better targeting of which layers of a neural network could use the lower precision number format. They also found a more intelligent way to adjust the power budget of each chip’s compute engines, and sped communication among GPUs in a way that Salvator likened to “buttering your toast while it’s still in the toaster.”

Additionally, the company implemented a scheme called flash attention. Invented in the Stanford University laboratory of Samba Nova founder Chris Re, flash attention is an algorithm that speeds transformer networks by minimizing writes to memory. When it first showed up in MLPerf benchmarks, flash attention shaved as much as 10 percent from training times. (Intel, too, used a version of flash attention but not for GPT-3. It instead used the algorithm for one of the new benchmarks, fine-tuning.)

Using other software and network tricks, Nvidia delivered an 80 percent speedup in the text-to-image test, Stable Diffusion, versus its submission in November 2023.

New benchmarks

MLPerf adds new benchmarks and upgrades old ones to stay relevant to what’s happening in the AI industry. This year saw the addition of fine-tuning and graph neural networks.

Fine tuning takes an already trained LLM and specializes it for use in a particular field. Nvidia, for example took a trained 43-billion-parameter model and trained it on the GPU-maker’s design files and documentation to create ChipNeMo, an AI intended to boost the productivity of its chip designers. At the time, the company’s chief technology officer Bill Dally said that training an LLM was like giving it a liberal arts education, and fine tuning was like sending it to graduate school.

The MLPerf benchmark takes a pretrained Llama-2-70B model and asks the system to fine tune it using a dataset of government documents with the goal of generating more accurate document summaries.

There are several ways to do fine-tuning. MLPerf chose one called low-rank adaptation (LoRA). The method winds up training only a small portion of the LLM’s parameters leading to a 3-fold lower burden on hardware and reduced use of memory and storage versus other methods, according to the organization.

The other new benchmark involved a graph neural network (GNN). These are for problems that can be represented by a very large set of interconnected nodes, such as a social network or a recommender system. Compared to other AI tasks, GNNs require a lot of communication between nodes in a computer.

The benchmark trained a GNN on a database that shows relationships about academic authors, papers, and institutes—a graph with 547 million nodes and 5.8 billion edges. The neural network was then trained to predict the right label for each node in the graph.

Future fights

Training rounds in 2025 may see head-to-head contests comparing new accelerators from AMD, Intel, and Nvidia. AMD’s MI300 series was launched about six months ago, and a memory-boosted upgrade the MI325x is planned for the end of 2024, with the next generation MI350 slated for 2025. Intel says its Gaudi 3, generally available to computer makers later this year, will appear in MLPerf’s upcoming inferencing benchmarks. Intel executives have said the new chip has the capacity to beat H100 at training LLMs. But the victory may be short-lived, as Nvidia has unveiled a new architecture, Blackwell, which is planned for late this year.

IEEE Spectrum Recent Content full text
Hybrid Bonding Plays Starring Role in 3D Chips
11 August 2024 at 15:00

Hybrid Bonding Plays Starring Role in 3D Chips

IEEE Spectrum Recent Content full text

By: Samuel K. Moore

11 August 2024 at 15:00

Chipmakers continue to claw for every spare nanometer to continue scaling down circuits, but a technology involving things that are much bigger—hundreds or thousands of nanometers across—could be just as significant over the next five years.

Called hybrid bonding, that technology stacks two or more chips atop one another in the same package. That allows chipmakers to increase the number of transistors in their processors and memories despite a general slowdown in the shrinking of transistors, which once drove Moore’s Law. At the IEEE Electronic Components and Technology Conference (ECTC) this past May in Denver, research groups from around the world unveiled a variety of hard-fought improvements to the technology, with a few showing results that could lead to a record density of connections between 3D stacked chips: some 7 million links per square millimeter of silicon.

All those connections are needed because of the new nature of progress in semiconductors, Intel’s Yi Shi told engineers at ECTC. Moore’s Law is now governed by a concept called system technology co-optimization, or STCO, whereby a chip’s functions, such as cache memory, input/output, and logic, are fabricated separately using the best manufacturing technology for each. Hybrid bonding and other advanced packaging tech can then be used to assemble these subsystems so that they work every bit as well as a single piece of silicon. But that can happen only when there’s a high density of connections that can shuttle bits between the separate pieces of silicon with little delay or energy consumption.

Out of all the advanced-packaging technologies, hybrid bonding provides the highest density of vertical connections. Consequently, it is the fastest growing segment of the advanced-packaging industry, says Gabriela Pereira, technology and market analyst at Yole Group. The overall market is set to more than triple to US $38 billion by 2029, according to Yole, which projects that hybrid bonding will make up about half the market by then, although today it’s just a small portion.

In hybrid bonding, copper pads are built on the top face of each chip. The copper is surrounded by insulation, usually silicon oxide, and the pads themselves are slightly recessed from the surface of the insulation. After the oxide is chemically modified, the two chips are then pressed together face-to-face, so that the recessed pads on each align. This sandwich is then slowly heated, causing the copper to expand across the gap and fuse, connecting the two chips.

Making Hybrid Bonding Better

An illustration showing how to make hybrid bonding better

Hybrid bonding starts with two wafers or a chip and a wafer facing each other. The mating surfaces are covered in oxide insulation and slightly recessed copper pads connected to the chips’ interconnect layers.
The wafers are pressed together to form an initial bond between the oxides.
The stacked wafers are then heated slowly, strongly linking the oxides and expanding the copper to form an electrical connection.

To form more secure bonds, engineers are flattening the last few nanometers of oxide. Even slight bulges or warping can break dense connections.
The copper must be recessed from the surface of the oxide just the right amount. Too much and it will fail to form a connection. Too little and it will push the wafers apart. Researchers are working on ways to control the level of copper down to single atomic layers.
The initial links between the wafers are weak hydrogen bonds. After annealing, the links are strong covalent bonds [below]. Researchers expect that using different types of surfaces, such as silicon carbonitride, which has more locations to form chemical bonds, will lead to stronger links between the wafers.
The final step in hybrid bonding can take hours and require high temperatures. Researchers hope to lower the temperature and shorten the process time.
Although the copper from both wafers presses together to form an electrical connection, the metal’s grain boundaries generally do not cross from one side to the other. Researchers are trying to cause large single grains of copper to form across the boundary to improve conductance and stability.

Hybrid bonding can either attach individual chips of one size to a wafer full of chips of a larger size or bond two full wafers of chips of the same size. Thanks in part to its use in camera chips, the latter process is more mature than the former, Pereira says. For example, engineers at the European microelectronics-research institute Imec have created some of the most dense wafer-on-wafer bonds ever, with a bond-to-bond distance (or pitch) of just 400 nanometers. But Imec managed only a 2-micrometer pitch for chip-on-wafer bonding.

The latter is a huge improvement over the advanced 3D chips in production today, which have connections about 9 μm apart. And it’s an even bigger leap over the predecessor technology: “microbumps” of solder, which have pitches in the tens of micrometers.

“With the equipment available, it’s easier to align wafer to wafer than chip to wafer. Most processes for microelectronics are made for [full] wafers,” says Jean-Charles Souriau, scientific leader in integration and packaging at the French research organization CEA Leti. But it’s chip-on-wafer (or die-to-wafer) that’s making a splash in high-end processors such as those from AMD, where the technique is used to assemble compute cores and cache memory in its advanced CPUs and AI accelerators.

In pushing for tighter and tighter pitches for both scenarios, researchers are focused on making surfaces flatter, getting bound wafers to stick together better, and cutting the time and complexity of the whole process. Getting it right could revolutionize how chips are designed.

WoW, Those Are Some Tight Pitches

The recent wafer-on-wafer (WoW) research that achieved the tightest pitches—from 360 nm to 500 nm—involved a lot of effort on one thing: flatness. To bond two wafers together with 100-nm-level accuracy, the whole wafer has to be nearly perfectly flat. If it’s bowed or warped to the slightest degree, whole sections won’t connect.

Flattening wafers is the job of a process called chemical mechanical planarization, or CMP. It’s essential to chipmaking generally, especially for producing the layers of interconnects above the transistors.

“CMP is a key parameter we have to control for hybrid bonding,” says Souriau. The results presented at ECTC show CMP being taken to another level, not just flattening across the wafer but reducing mere nanometers of roundness on the insulation between the copper pads to ensure better connections.

“It’s difficult to say what the limit will be. Things are moving very fast.” —Jean-Charles Souriau, CEA Leti

Other researchers focused on ensuring those flattened parts stick together strongly enough. They did so by experimenting with different surface materials such as silicon carbonitride instead of silicon oxide and by using different schemes to chemically activate the surface. Initially, when wafers or dies are pressed together, they are held in place with relatively weak hydrogen bonds, and the concern is whether everything will stay in place during further processing steps. After attachment, wafers and chips are then heated slowly, in a process called annealing, to form stronger chemical bonds. Just how strong these bonds are—and even how to figure that out—was the subject of much of the research presented at ECTC.

Part of that final bond strength comes from the copper connections. The annealing step expands the copper across the gap to form a conductive bridge. Controlling the size of that gap is key, explains Samsung’s Seung Ho Hahn. Too little expansion, and the copper won’t fuse. Too much, and the wafers will be pushed apart. It’s a matter of nanometers, and Hahn reported research on a new chemical process that he hopes to use to get it just right by etching away the copper a single atomic layer at a time.

The quality of the connection counts, too. The metals in chip interconnects are not a single crystal; instead they’re made up of many grains, crystals oriented in different directions. Even after the copper expands, the metal’s grain boundaries often don’t cross from one side to another. Such a crossing should reduce a connection’s electrical resistance and boost its reliability. Researchers at Tohoku University in Japan reported a new metallurgical scheme that could finally generate large, single grains of copper that cross the boundary. “This is a drastic change,” says Takafumi Fukushima, an associate professor at Tohoku. “We are now analyzing what underlies it.”

Other experiments discussed at ECTC focused on streamlining the bonding process. Several sought to reduce the annealing temperature needed to form bonds—typically around 300 °C—as to minimize any risk of damage to the chips from the prolonged heating. Researchers from Applied Materials presented progress on a method to radically reduce the time needed for annealing—from hours to just 5 minutes.

CoWs That Are Outstanding in the Field

A series of gray-scale images of the corner of an object at increasing magnification. Imec used plasma etching to dice up chips and give them chamfered corners. The technique relieves mechanical stress that could interfere with bonding.Imec

Chip-on-wafer (CoW) hybrid bonding is more useful to makers of advanced CPUs and GPUs at the moment: It allows chipmakers to stack chiplets of different sizes and to test each chip before it’s bound to another, ensuring that they aren’t dooming an expensive CPU with a single flawed part.

But CoW comes with all of the difficulties of WoW and fewer of the options to alleviate them. For example, CMP is designed to flatten wafers, not individual dies. Once dies have been cut from their source wafer and tested, there’s less that can be done to improve their readiness for bonding.

Nevertheless, researchers at Intel reported CoW hybrid bonds with a 3-μm pitch, and, as mentioned, a team at Imec managed 2 μm, largely by making the transferred dies very flat while they were still attached to the wafer and keeping them extra clean throughout the process. Both groups used plasma etching to dice up the dies instead of the usual method, which uses a specialized blade. Unlike a blade, plasma etching doesn’t lead to chipping at the edges, which creates debris that could interfere with connections. It also allowed the Imec group to shape the die, making chamfered corners that relieve mechanical stress that could break connections.

CoW hybrid bonding is going to be critical to the future of high-bandwidth memory (HBM), according to several researchers at ECTC. HBM is a stack of DRAM dies—currently 8 to 12 dies high—atop a control-logic chip. Often placed within the same package as high-end GPUs, HBM is crucial to handling the tsunami of data needed to run large language models like ChatGPT. Today, HBM dies are stacked using microbump technology, so there are tiny balls of solder surrounded by an organic filler between each layer.

But with AI pushing memory demand even higher, DRAM makers want to stack 20 layers or more in HBM chips. The volume that microbumps take up means that these stacks will soon be too tall to fit properly in the package with GPUs. Hybrid bonding would shrink the height of HBMs and also make it easier to remove excess heat from the package, because there would be less thermal resistance between its layers.

“I think it’s possible to make a more-than-20-layer stack using this technology.” —Hyeonmin Lee, Samsung

At ECTC, Samsung engineers showed that hybrid bonding could yield a 16-layer HBM stack. “I think it’s possible to make a more-than-20-layer stack using this technology,” says Hyeonmin Lee, a senior engineer at Samsung. Other new CoW technology could also help bring hybrid bonding to high-bandwidth memory. Researchers at CEA Leti are exploring what’s known as self-alignment technology, says Souriau. That would help ensure good CoW connections using just chemical processes. Some parts of each surface would be made hydrophobic and some hydrophilic, resulting in surfaces that would slide into place automatically.

At ECTC, researchers from Tohoku University and Yamaha Robotics reported work on a similar scheme, using the surface tension of water to align 5-μm pads on experimental DRAM chips with better than 50-nm accuracy.

The Bounds of Hybrid Bonding

Researchers will almost certainly keep reducing the pitch of hybrid-bonding connections. A 200-nm WoW pitch is not just possible but desirable, Han-Jong Chia, a project manager for pathfinding systems at Taiwan Semiconductor Manufacturing Co. , told engineers at ECTC. Within two years, TSMC plans to introduce a technology called backside power delivery. (Intel plans the same for the end of this year.) That’s a technology that puts the chip’s chunky power-delivery interconnects below the surface of the silicon instead of above it. With those power conduits out of the way, the uppermost levels can connect better to smaller hybrid-bonding bond pads, TSMC researchers calculate. Backside power delivery with 200-nm bond pads would cut down the capacitance of 3D connections so much that a measure of energy efficiency and signal speed would be as much as eight times better than what can be achieved with 400-nm bond pads.

Black squares dot most of the top of an orange metallic disc. Chip-on-wafer hybrid bonding is more useful than wafer-on-wafer bonding, in that it can place dies of one size onto a wafer of larger dies. However, the density of connections that can be achieved is lower than for wafer-on-wafer bonding.Imec

At some point in the future, if bond pitches narrow even further, Chia suggests, it might become practical to “fold” blocks of circuitry so they are built across two wafers. That way some of what are now long connections within the block might be able to take a vertical shortcut, potentially speeding computations and lowering power consumption.

And hybrid bonding may not be limited to silicon. “Today there is a lot of development in silicon-to-silicon wafers, but we are also looking to do hybrid bonding between gallium nitride and silicon wafers and glass wafers…everything on everything,” says CEA Leti’s Souriau. His organization even presented research on hybrid bonding for quantum-computing chips, which involves aligning and bonding superconducting niobium instead of copper.

“It’s difficult to say what the limit will be,” Souriau says. “Things are moving very fast.”

This article was updated on 11 August 2024.

This article appears in the September 2024 print issue as “The Copper Connection.”