“What are the differences between trail shoes and running shoes?”
“What are the best dinosaur toys for a five year old?”
These are some of the open-ended questions customers might ask a helpful sales associate in a brick-and-mortar store. But how can customers get answers to similar questions while shopping online?
Amazon’s answer is Rufus, a shopping assistant powered by generative AI. Rufus helps Amazon customers make more informed shopping decisions by answering a wide range of questions within the Amazon app. Users can get product details, compare options, and receive product recommendations.
I lead the team of scientists and engineers that built the large language model (LLM) that powers Rufus. To build a helpful conversational shopping assistant, we used innovative techniques across multiple aspects of generative AI. We built a custom LLM specialized for shopping; employed retrieval-augmented generation with a variety of novel evidence sources; leveraged reinforcement learning to improve responses; made advances in high-performance computing to improve inference efficiency and reduce latency; and implemented a new streaming architecture to get shoppers their answers faster.
How Rufus Gets Answers
Most LLMs are first trained on a broad dataset that informs the model’s overall knowledge and capabilities, and then are customized for a particular domain. That wouldn’t work for Rufus, since our aim was to train it on shopping data from the very beginning—the entire Amazon catalog, for starters, as well as customer reviews and information from community Q&A posts. So our scientists built a custom LLM that was trained on these data sources along with public information on the web.
But to be prepared to answer the vast span of questions that could possibly be asked, Rufus must be empowered to go beyond its initial training data and bring in fresh information. For example, to answer the question, “Is this pan dishwasher-safe?” the LLM first parses the question, then it figures out which retrieval sources will help it generate the answer.
Our LLM uses retrieval-augmented generation (RAG) to pull in information from sources known to be reliable, such as the product catalog, customer reviews, and community Q&A posts; it can also call relevant Amazon Stores APIs. Our RAG system is enormously complex, both because of the variety of data sources used and the differing relevance of each one, depending on the question.
Every LLM, and every use of generative AI, is a work in progress. For Rufus to get better over time, it needs to learn which responses are helpful and which can be improved. Customers are the best source of that information. Amazon encourages customers to give Rufus feedback, letting the model know if they liked or disliked the answer, and those responses are used in a reinforcement learning process. Over time, Rufus learns from customer feedback and improves its responses.
Special Chips and Handling Techniques for Rufus
Rufus needs to be able to engage with millions of customers simultaneously without any noticeable delay. This is particularly challenging since generative AI applications are very compute-intensive, especially at Amazon’s scale.
To minimize delay in generating responses while also maximizing the number of responses that our system could handle, we turned to Amazon’s specialized AI chips, Trainium and Inferentia, which are integrated with core Amazon Web Services (AWS). We collaborated with AWS on optimizations that improve model inference efficiency, which were then made available to all AWS customers.
But standard methods of processing user requests in batches will cause latency and throughput problems because it’s difficult to predict how many tokens (in this case, units of text) an LLM will generate as it composes each response. Our scientists worked with AWS to enable Rufus to use continuous batching, a novel LLM technique that enables the model to start serving new requests as soon as the first request in the batch finishes, rather than waiting for all requests in a batch to finish. This technique improves the computational efficiency of AI chips and allows shoppers to get their answers quickly.
We want Rufus to provide the most relevant and helpful answer to any given question. Sometimes that means a long-form text answer, but sometimes it’s short-form text, or a clickable link to navigate the store. And we had to make sure the presented information follows a logical flow. If we don’t group and format things correctly, we could end up with a confusing response that’s not very helpful to the customer.
That’s why Rufus uses an advanced streaming architecture for delivering responses. Customers don’t need to wait for a long answer to be fully generated—instead, they get the first part of the answer while the rest is being generated. Rufus populates the streaming response with the right data (a process called hydration) by making queries to internal systems. In addition to generating the content for the response, it also generates formatting instructions that specify how various answer elements should be displayed.
Even though Amazon has been using AI for more than 25 years to improve the customer experience, generative AI represents something new and transformative. We’re proud of Rufus, and the new capabilities it provides to our customers.
Chipmakers continue to claw for every spare nanometer to continue scaling down circuits, but a technology involving things that are much bigger—hundreds or thousands of nanometers across—could be just as significant over the next five years.
Called hybrid bonding, that technology stacks two or more chips atop one another in the same package. That allows chipmakers to increase the number of transistors in their processors and memories despite a general slowdown in the shrinking of transistors, which once drove Moore’s Law. At the
IEEE Electronic Components and Technology Conference (ECTC) this past May in Denver, research groups from around the world unveiled a variety of hard-fought improvements to the technology, with a few showing results that could lead to a record density of connections between 3D stacked chips: some 7 million links per square millimeter of silicon.
All those connections are needed because of the new nature of progress in
semiconductors, Intel’s Yi Shi told engineers at ECTC. Moore’s Law is now governed by a concept called system technology co-optimization, or STCO, whereby a chip’s functions, such as cache memory, input/output, and logic, are fabricated separately using the best manufacturing technology for each. Hybrid bonding and other advanced packaging tech can then be used to assemble these subsystems so that they work every bit as well as a single piece of silicon. But that can happen only when there’s a high density of connections that can shuttle bits between the separate pieces of silicon with little delay or energy consumption.
Out of all the advanced-packaging technologies, hybrid bonding provides the highest density of vertical connections. Consequently, it is the fastest growing segment of the advanced-packaging industry, says
Gabriela Pereira, technology and market analyst at Yole Group. The overall market is set to more than triple to US $38 billion by 2029, according to Yole, which projects that hybrid bonding will make up about half the market by then, although today it’s just a small portion.
In hybrid bonding, copper pads are built on the top face of each chip. The copper is surrounded by insulation, usually silicon oxide, and the pads themselves are slightly recessed from the surface of the insulation. After the oxide is chemically modified, the two chips are then pressed together face-to-face, so that the recessed pads on each align. This sandwich is then slowly heated, causing the copper to expand across the gap and fuse, connecting the two chips.
Making Hybrid Bonding Better
Hybrid bonding starts with two wafers or a chip and a wafer facing each other. The mating surfaces are covered in oxide insulation and slightly recessed copper pads connected to the chips’ interconnect layers.
The wafers are pressed together to form an initial bond between the oxides.
The stacked wafers are then heated slowly, strongly linking the oxides and expanding the copper to form an electrical connection.
To form more secure bonds, engineers are flattening the last few nanometers of oxide. Even slight bulges or warping can break dense connections.
The copper must be recessed from the surface of the oxide just the right amount. Too much and it will fail to form a connection. Too little and it will push the wafers apart. Researchers are working on ways to control the level of copper down to single atomic layers.
The initial links between the wafers are weak hydrogen bonds. After annealing, the links are strong covalent bonds [below]. Researchers expect that using different types of surfaces, such as silicon carbonitride, which has more locations to form chemical bonds, will lead to stronger links between the wafers.
The final step in hybrid bonding can take hours and require high temperatures. Researchers hope to lower the temperature and shorten the process time.
Although the copper from both wafers presses together to form an electrical connection, the metal’s grain boundaries generally do not cross from one side to the other. Researchers are trying to cause large single grains of copper to form across the boundary to improve conductance and stability.
Hybrid bonding can either attach individual chips of one size to a wafer full of chips of a larger size or bond two full wafers of chips of the same size. Thanks in part to its use in camera chips, the latter process is more mature than the former, Pereira says. For example, engineers at the European microelectronics-research institute
Imec have created some of the most dense wafer-on-wafer bonds ever, with a bond-to-bond distance (or pitch) of just 400 nanometers. But Imec managed only a 2-micrometer pitch for chip-on-wafer bonding.
The latter is a huge improvement over the advanced 3D chips in production today, which have connections about 9 μm apart. And it’s an even bigger leap over the predecessor technology: “microbumps” of solder, which have pitches in the tens of micrometers.
“With the equipment available, it’s easier to align wafer to wafer than chip to wafer. Most processes for microelectronics are made for [full] wafers,” says
Jean-Charles Souriau, scientific leader in integration and packaging at the French research organization CEA Leti. But it’s chip-on-wafer (or die-to-wafer) that’s making a splash in high-end processors such as those from AMD, where the technique is used to assemble compute cores and cache memory in its advanced CPUs and AI accelerators.
In pushing for tighter and tighter pitches for both scenarios, researchers are focused on making surfaces flatter, getting bound wafers to stick together better, and cutting the time and complexity of the whole process. Getting it right could revolutionize how chips are designed.
WoW, Those Are Some Tight Pitches
The recent wafer-on-wafer (WoW) research that achieved the tightest pitches—from 360 nm to 500 nm—involved a lot of effort on one thing: flatness. To bond two wafers together with 100-nm-level accuracy, the whole wafer has to be nearly perfectly flat. If it’s bowed or warped to the slightest degree, whole sections won’t connect.
Flattening wafers is the job of a process called chemical mechanical planarization, or CMP. It’s essential to chipmaking generally, especially for producing the layers of interconnects above the
transistors.
“CMP is a key parameter we have to control for hybrid bonding,” says Souriau. The results presented at ECTC show CMP being taken to another level, not just flattening across the wafer but reducing mere nanometers of roundness on the insulation between the copper pads to ensure better connections.
“It’s difficult to say what the limit will be. Things are moving very fast.” —Jean-Charles Souriau, CEA Leti
Other researchers focused on ensuring those flattened parts stick together strongly enough. They did so by experimenting with different surface materials such as silicon carbonitride instead of silicon oxide and by using different schemes to chemically activate the surface. Initially, when wafers or dies are pressed together, they are held in place with relatively weak hydrogen bonds, and the concern is whether everything will stay in place during further processing steps. After attachment, wafers and chips are then heated slowly, in a process called annealing, to form stronger chemical bonds. Just how strong these bonds are—and even how to figure that out—was the subject of much of the research presented at ECTC.
Part of that final bond strength comes from the copper connections. The annealing step expands the copper across the gap to form a conductive bridge. Controlling the size of that gap is key, explains Samsung’s
Seung Ho Hahn. Too little expansion, and the copper won’t fuse. Too much, and the wafers will be pushed apart. It’s a matter of nanometers, and Hahn reported research on a new chemical process that he hopes to use to get it just right by etching away the copper a single atomic layer at a time.
The quality of the connection counts, too. The metals in chip interconnects are not a single crystal; instead they’re made up of many grains, crystals oriented in different directions. Even after the copper expands, the metal’s grain boundaries often don’t cross from one side to another. Such a crossing should reduce a connection’s electrical resistance and boost its reliability. Researchers at Tohoku University in Japan reported a new metallurgical scheme that could finally generate large, single grains of copper that cross the boundary. “This is a drastic change,” says
Takafumi Fukushima, an associate professor at Tohoku. “We are now analyzing what underlies it.”
Other experiments discussed at ECTC focused on streamlining the bonding process. Several sought to reduce the annealing temperature needed to form bonds—typically around 300 °C—as to minimize any risk of damage to the chips from the prolonged heating. Researchers from
Applied Materials presented progress on a method to radically reduce the time needed for annealing—from hours to just 5 minutes.
CoWs That Are Outstanding in the Field
Imec used plasma etching to dice up chips and give them chamfered corners. The technique relieves mechanical stress that could interfere with bonding.Imec
Chip-on-wafer (CoW) hybrid bonding is more useful to makers of advanced CPUs and GPUs at the moment: It allows chipmakers to stack
chiplets of different sizes and to test each chip before it’s bound to another, ensuring that they aren’t dooming an expensive CPU with a single flawed part.
But CoW comes with all of the difficulties of WoW and fewer of the options to alleviate them. For example, CMP is designed to flatten wafers, not individual dies. Once dies have been cut from their source wafer and tested, there’s less that can be done to improve their readiness for bonding.
Nevertheless, researchers at
Intel reported CoW hybrid bonds with a 3-μm pitch, and, as mentioned, a team at Imec managed 2 μm, largely by making the transferred dies very flat while they were still attached to the wafer and keeping them extra clean throughout the process. Both groups used plasma etching to dice up the dies instead of the usual method, which uses a specialized blade. Unlike a blade, plasma etching doesn’t lead to chipping at the edges, which creates debris that could interfere with connections. It also allowed the Imec group to shape the die, making chamfered corners that relieve mechanical stress that could break connections.
CoW hybrid bonding is going to be critical to the future of high-bandwidth memory (HBM), according to several researchers at ECTC. HBM is a stack of DRAM dies—currently 8 to 12 dies high—atop a control-logic chip. Often placed within the same package as high-end
GPUs, HBM is crucial to handling the tsunami of data needed to run large language models like ChatGPT. Today, HBM dies are stacked using microbump technology, so there are tiny balls of solder surrounded by an organic filler between each layer.
But with AI pushing memory demand even higher, DRAM makers want to stack 20 layers or more in HBM chips. The volume that microbumps take up means that these stacks will soon be too tall to fit properly in the package with GPUs. Hybrid bonding would shrink the height of HBMs and also make it easier to remove excess heat from the package, because there would be less thermal resistance between its layers.
“I think it’s possible to make a more-than-20-layer stack using this technology.” —Hyeonmin Lee, Samsung
At ECTC, Samsung engineers showed that hybrid bonding could yield a 16-layer HBM stack. “I think it’s possible to make a more-than-20-layer stack using this technology,” says
Hyeonmin Lee, a senior engineer at Samsung. Other new CoW technology could also help bring hybrid bonding to high-bandwidth memory. Researchers at CEA Leti are exploring what’s known as self-alignment technology, says Souriau. That would help ensure good CoW connections using just chemical processes. Some parts of each surface would be made hydrophobic and some hydrophilic, resulting in surfaces that would slide into place automatically.
At ECTC, researchers from Tohoku University and Yamaha Robotics reported work on a similar scheme, using the surface tension of water to align 5-μm pads on experimental DRAM chips with better than 50-nm accuracy.
The Bounds of Hybrid Bonding
Researchers will almost certainly keep reducing the pitch of hybrid-bonding connections. A 200-nm WoW pitch is not just possible but desirable,
Han-Jong Chia, a project manager for pathfinding systems at Taiwan Semiconductor Manufacturing Co. , told engineers at ECTC. Within two years, TSMC plans to introduce a technology called backside power delivery. (Intel plans the same for the end of this year.) That’s a technology that puts the chip’s chunky power-delivery interconnects below the surface of the silicon instead of above it. With those power conduits out of the way, the uppermost levels can connect better to smaller hybrid-bonding bond pads, TSMC researchers calculate. Backside power delivery with 200-nm bond pads would cut down the capacitance of 3D connections so much that a measure of energy efficiency and signal speed would be as much as eight times better than what can be achieved with 400-nm bond pads.
Chip-on-wafer hybrid bonding is more useful than wafer-on-wafer bonding, in that it can place dies of one size onto a wafer of larger dies. However, the density of connections that can be achieved is lower than for wafer-on-wafer bonding.Imec
At some point in the future, if bond pitches narrow even further, Chia suggests, it might become practical to “fold” blocks of circuitry so they are built across two wafers. That way some of what are now long connections within the block might be able to take a vertical shortcut, potentially speeding computations and lowering power consumption.
And hybrid bonding may not be limited to silicon. “Today there is a lot of development in silicon-to-silicon wafers, but we are also looking to do hybrid bonding between gallium nitride and silicon wafers and glass wafers…everything on everything,” says CEA Leti’s Souriau. His organization even presented research on hybrid bonding for quantum-computing chips, which involves aligning and bonding superconducting niobium instead of copper.
“It’s difficult to say what the limit will be,” Souriau says. “Things are moving very fast.”
This article was updated on 11 August 2024.
This article appears in the September 2024 print issue as “The Copper Connection.”