Reading view

There are new articles available, click to refresh the page.

Amazon's Secret Weapon in Chip Design Is Amazon

15 September 2024 at 15:00

Big-name makers of processors, especially those geared toward cloud-based AI, such as AMD and Nvidia, have been showing signs of wanting to own more of the business of computing, purchasing makers of software, interconnects, and servers. The hope is that control of the “full stack” will give them an edge in designing what their customers want.

Amazon Web Services (AWS) got there ahead of most of the competition, when they purchased chip designer Annapurna Labs in 2015 and proceeded to design CPUs, AI accelerators, servers, and data centers as a vertically-integrated operation. Ali Saidi, the technical lead for the Graviton series of CPUs, and Rami Sinno, director of engineering at Annapurna Labs, explained the advantage of vertically-integrated design and Amazon-scale and showed IEEE Spectrum around the company’s hardware testing labs in Austin, Tex., on 27 August.

Saidi and Sinno on:

What brought you to Amazon Web Services, Rami?

an older man in an eggplant colored polo shirt posing for a portrait Rami SinnoAWS

Rami Sinno: Amazon is my first vertically integrated company. And that was on purpose. I was working at Arm, and I was looking for the next adventure, looking at where the industry is heading and what I want my legacy to be. I looked at two things:

One is vertically integrated companies, because this is where most of the innovation is—the interesting stuff is happening when you control the full hardware and software stack and deliver directly to customers.

And the second thing is, I realized that machine learning, AI in general, is going to be very, very big. I didn’t know exactly which direction it was going to take, but I knew that there is something that is going to be generational, and I wanted to be part of that. I already had that experience prior when I was part of the group that was building the chips that go into the Blackberries; that was a fundamental shift in the industry. That feeling was incredible, to be part of something so big, so fundamental. And I thought, “Okay, I have another chance to be part of something fundamental.”

Does working at a vertically-integrated company require a different kind of chip design engineer?

Sinno: Absolutely. When I hire people, the interview process is going after people that have that mindset. Let me give you a specific example: Say I need a signal integrity engineer. (Signal integrity makes sure a signal going from point A to point B, wherever it is in the system, makes it there correctly.) Typically, you hire signal integrity engineers that have a lot of experience in analysis for signal integrity, that understand layout impacts, can do measurements in the lab. Well, this is not sufficient for our group, because we want our signal integrity engineers also to be coders. We want them to be able to take a workload or a test that will run at the system level and be able to modify it or build a new one from scratch in order to look at the signal integrity impact at the system level under workload. This is where being trained to be flexible, to think outside of the little box has paid off huge dividends in the way that we do development and the way we serve our customers.

“By the time that we get the silicon back, the software’s done” —Ali Saidi, Annapurna Labs

At the end of the day, our responsibility is to deliver complete servers in the data center directly for our customers. And if you think from that perspective, you’ll be able to optimize and innovate across the full stack. A design engineer or a test engineer should be able to look at the full picture because that’s his or her job, deliver the complete server to the data center and look where best to do optimization. It might not be at the transistor level or at the substrate level or at the board level. It could be something completely different. It could be purely software. And having that knowledge, having that visibility, will allow the engineers to be significantly more productive and delivery to the customer significantly faster. We’re not going to bang our head against the wall to optimize the transistor where three lines of code downstream will solve these problems, right?

Do you feel like people are trained in that way these days?

Sinno: We’ve had very good luck with recent college grads. Recent college grads, especially the past couple of years, have been absolutely phenomenal. I’m very, very pleased with the way that the education system is graduating the engineers and the computer scientists that are interested in the type of jobs that we have for them.

The other place that we have been super successful in finding the right people is at startups. They know what it takes, because at a startup, by definition, you have to do so many different things. People who’ve done startups before completely understand the culture and the mindset that we have at Amazon.

[back to top]

What brought you to AWS, Ali?

Ali SaidiAWS

Ali Saidi: I’ve been here about seven and a half years. When I joined AWS, I joined a secret project at the time. I was told: “We’re going to build some Arm servers. Tell no one.”

We started with Graviton 1. Graviton 1 was really the vehicle for us to prove that we could offer the same experience in AWS with a different architecture.

The cloud gave us an ability for a customer to try it in a very low-cost, low barrier of entry way and say, “Does it work for my workload?” So Graviton 1 was really just the vehicle demonstrate that we could do this, and to start signaling to the world that we want software around ARM servers to grow and that they’re going to be more relevant.

Graviton 2—announced in 2019—was kind of our first… what we think is a market-leading device that’s targeting general-purpose workloads, web servers, and those types of things.

It’s done very well. We have people running databases, web servers, key-value stores, lots of applications... When customers adopt Graviton, they bring one workload, and they see the benefits of bringing that one workload. And then the next question they ask is, “Well, I want to bring some more workloads. What should I bring?” There were some where it wasn’t powerful enough effectively, particularly around things like media encoding, taking videos and encoding them or re-encoding them or encoding them to multiple streams. It’s a very math-heavy operation and required more [single-instruction multiple data] bandwidth. We need cores that could do more math.

We also wanted to enable the [high-performance computing] market. So we have an instance type called HPC 7G where we’ve got customers like Formula One. They do computational fluid dynamics of how this car is going to disturb the air and how that affects following cars. It’s really just expanding the portfolio of applications. We did the same thing when we went to Graviton 4, which has 96 cores versus Graviton 3’s 64.

[back to top]

How do you know what to improve from one generation to the next?

Saidi: Far and wide, most customers find great success when they adopt Graviton. Occasionally, they see performance that isn’t the same level as their other migrations. They might say “I moved these three apps, and I got 20 percent higher performance; that’s great. But I moved this app over here, and I didn’t get any performance improvement. Why?” It’s really great to see the 20 percent. But for me, in the kind of weird way I am, the 0 percent is actually more interesting, because it gives us something to go and explore with them.

Most of our customers are very open to those kinds of engagements. So we can understand what their application is and build some kind of proxy for it. Or if it’s an internal workload, then we could just use the original software. And then we can use that to kind of close the loop and work on what the next generation of Graviton will have and how we’re going to enable better performance there.

What’s different about designing chips at AWS?

Saidi: In chip design, there are many different competing optimization points. You have all of these conflicting requirements, you have cost, you have scheduling, you’ve got power consumption, you’ve got size, what DRAM technologies are available and when you’re going to intersect them… It ends up being this fun, multifaceted optimization problem to figure out what’s the best thing that you can build in a timeframe. And you need to get it right.

One thing that we’ve done very well is taken our initial silicon to production.

How?

Saidi: This might sound weird, but I’ve seen other places where the software and the hardware people effectively don’t talk. The hardware and software people in Annapurna and AWS work together from day one. The software people are writing the software that will ultimately be the production software and firmware while the hardware is being developed in cooperation with the hardware engineers. By working together, we’re closing that iteration loop. When you are carrying the piece of hardware over to the software engineer’s desk your iteration loop is years and years. Here, we are iterating constantly. We’re running virtual machines in our emulators before we have the silicon ready. We are taking an emulation of [a complete system] and running most of the software we’re going to run.

So by the time that we get to the silicon back [from the foundry], the software’s done. And we’ve seen most of the software work at this point. So we have very high confidence that it’s going to work.

The other piece of it, I think, is just being absolutely laser-focused on what we are going to deliver. You get a lot of ideas, but your design resources are approximately fixed. No matter how many ideas I put in the bucket, I’m not going to be able to hire that many more people, and my budget’s probably fixed. So every idea I throw in the bucket is going to use some resources. And if that feature isn’t really important to the success of the project, I’m risking the rest of the project. And I think that’s a mistake that people frequently make.

Are those decisions easier in a vertically integrated situation?

Saidi: Certainly. We know we’re going to build a motherboard and a server and put it in a rack, and we know what that looks like… So we know the features we need. We’re not trying to build a superset product that could allow us to go into multiple markets. We’re laser-focused into one.

What else is unique about the AWS chip design environment?

Saidi: One thing that’s very interesting for AWS is that we’re the cloud and we’re also developing these chips in the cloud. We were the first company to really push on running [electronic design automation (EDA)] in the cloud. We changed the model from “I’ve got 80 servers and this is what I use for EDA” to “Today, I have 80 servers. If I want, tomorrow I can have 300. The next day, I can have 1,000.”

We can compress some of the time by varying the resources that we use. At the beginning of the project, we don’t need as many resources. We can turn a lot of stuff off and not pay for it effectively. As we get to the end of the project, now we need many more resources. And instead of saying, “Well, I can’t iterate this fast, because I’ve got this one machine, and it’s busy.” I can change that and instead say, “Well, I don’t want one machine; I’ll have 10 machines today.”

Instead of my iteration cycle being two days for a big design like this, instead of being even one day, with these 10 machines I can bring it down to three or four hours. That’s huge.

How important is Amazon.com as a customer?

Saidi: They have a wealth of workloads, and we obviously are the same company, so we have access to some of those workloads in ways that with third parties, we don’t. But we also have very close relationships with other external customers.

So last Prime Day, we said that 2,600 Amazon.com services were running on Graviton processors. This Prime Day, that number more than doubled to 5,800 services running on Graviton. And the retail side of Amazon used over 250,000 Graviton CPUs in support of the retail website and the services around that for Prime Day.

[back to top]

The AI accelerator team is colocated with the labs that test everything from chips through racks of servers. Why?

Sinno: So Annapurna Labs has multiple labs in multiple locations as well. This location here is in Austin… is one of the smaller labs. But what’s so interesting about the lab here in Austin is that you have all of the hardware and many software development engineers for machine learning servers and for Trainium and Inferentia [AWS’s AI chips] effectively co-located on this floor. For hardware developers, engineers, having the labs co-located on the same floor has been very, very effective. It speeds execution and iteration for delivery to the customers. This lab is set up to be self-sufficient with anything that we need to do, at the chip level, at the server level, at the board level. Because again, as I convey to our teams, our job is not the chip; our job is not the board; our job is the full server to the customer.

How does vertical integration help you design and test chips for data-center-scale deployment?

Sinno: It’s relatively easy to create a bar-raising server. Something that’s very high-performance, very low-power. If we create 10 of them, 100 of them, maybe 1,000 of them, it’s easy. You can cherry pick this, you can fix this, you can fix that. But the scale that the AWS is at is significantly higher. We need to train models that require 100,000 of these chips. 100,000! And for training, it’s not run in five minutes. It’s run in hours or days or weeks even. Those 100,000 chips have to be up for the duration. Everything that we do here is to get to that point.

We start from a “what are all the things that can go wrong?” mindset. And we implement all the things that we know. But when you were talking about cloud scale, there are always things that you have not thought of that come up. These are the 0.001-percent type issues.

In this case, we do the debug first in the fleet. And in certain cases, we have to do debugs in the lab to find the root cause. And if we can fix it immediately, we fix it immediately. Being vertically integrated, in many cases we can do a software fix for it. We use our agility to rush a fix while at the same time making sure that the next generation has it already figured out from the get go.

[back to top]

Amazon Vies for Nuclear-Powered Data Center

IEEE Spectrum Recent Content full text

Andrew Moseman

12 August 2024 at 20:36

When Amazon Web Services paid US $650 million in March for another data center to add to its armada, the tech giant thought it was buying a steady supply of nuclear energy to power it, too. The Susquehanna Steam Electric Station outside of Berick, Pennsylvania, which generates 2.5 gigawatts of nuclear power, sits adjacent to the humming data center and had been directly powering it since the center opened in 2023.

After striking the deal, Amazon wanted to change the terms of its original agreement to buy 180 megawatts of additional power directly from the nuclear plant. Susquehanna agreed to sell it. But third parties weren’t happy about that, and their deal has become bogged down in a regulatory battle that will likely set a precedent for data centers, cryptocurrency mining operations, and other computing facilities with voracious appetites for clean electricity.

Putting a data center right next to a power plant so that it can draw electricity from it directly, rather than from the grid, is becoming more common as data centers seek out cheap, steady, carbon-free power. Proposals for co-locating data centers next to nuclear power have popped up in New Jersey, Texas, Ohio, and elsewhere. Sweden is considering using small modular reactors to power future data centers.

However, co-location raises questions about equity and energy security, because directly-connected data centers can avoid paying fees that would otherwise help maintain grids. They also hog hundreds of megawatts that could be going elsewhere.

“They’re effectively going behind the meter and taking that capacity off of the grid that would otherwise serve all customers,” says Tony Clark, a senior advisor at the law firm Wilkinson Barker Knauer and a former commissioner at the Federal Energy Regulatory Commission (FERC), who has testified to a U.S. House subcommittee on the subject.

Amazon’s nuclear power deal meets hurdles

The dust-up over the Amazon-Susquehanna agreement started in June, after Amazon subsidiary Amazon Web Services filed a notice to change its interconnection service agreement (ISA) in order to buy more nuclear power from Susquehanna’s parent company, Talen Energy. Amazon wanted to increase the amount of behind-the-meter power it buys from the plant from 300 MW to 480 MW. Shortly after it requested the change, utility giants Exelon and American Electric Power (AEP), filed a protest against the agreement and asked FERC to hold a hearing on the matter.

Their complaint: the deal between Amazon and the nuclear plant would hurt a third party, namely all the customers who buy power from AEP or Exelon utilities. The protest document argues that the arrangement would shift up to $140 million in extra costs onto the people of Pennsylvania, New Jersey, and other states served by PJM, a regional transmission organization that oversees the grid in those areas. “Multiplied by the many similar projects on the drawing board, it is apparent that this unsupported filing has huge financial consequences that should not be imposed on ratepayers without sufficient process to determine and evaluate what is really going on,” their complaint says.

Susquehanna dismissed the argument, effectively saying that its deal with Amazon is none of AEP and Exelon’s business. “It is an unlawful attempt to hijack this limited [ISA] amendment proceeding that they have no stake in and turn it into an ad hoc national referendum on the future of data center load,” Susquehanna’s statement said. (AEP, Exelon, Talen/Susquehanna, and Amazon all declined to comment for this story.)

More disputes like this will likely follow as more data centers co-locate with clean energy. Kevin Schneider, a power system expert at Pacific Northwest National Laboratory and research professor at Washington State University, says it’s only natural that data center operators want the constant, consistent nature of nuclear power. “If you look at the base load nature of nuclear, you basically run it up to a power level and leave it there. It can be well aligned with a server farm.”

Data center operators are also exploring energy options from solar and wind, but these energy sources would have a difficult time matching the constancy of nuclear, even with grid storage to help even out their supply. So giant tech firms look to nuclear to keep their servers running without burning fossil fuels, and use that to trumpet their carbon-free achievements, as Amazon did when it bought the data center in Pennsylvania. “Whether you’re talking about Google or Apple or Microsoft or any of those companies, they tend to have corporate sustainability goals. Being served by a nuclear unit looks great on their corporate carbon balance sheet,” Clark says.

Costs of data centers seeking nuclear energy

Yet such arrangements could have major consequences for other energy customers, Clark argues. For one, directing all the energy from a nuclear plant to a data center is, fundamentally, no different than retiring that plant and taking it offline. “It’s just a huge chunk of capacity leaving the system,” he says, resulting in higher prices and less energy supply for everyone else.

Another issue is the “behind-the-meter” aspect of these kinds of deals. A data center could just connect to the grid and draw from the same supply as everyone else, Clark says. But by connecting directly to the power plant, the center’s owner avoids paying the administrative fees that are used to maintain the grid and grow its infrastructure. Those costs could then get passed on to businesses and residents who have to buy power from the grid. “There’s just a whole list of charges that get assessed through the network service that if you don’t connect through the network, you don’t have to pay,” Clark says. “And those charges are the part of the bill that will go up” for everyone else.

Even the “carbon-free” public relations talking points that come with co-location may be suspect in some cases. In Washington State, where Schneider works, new data centers are being planted next to the region’s abundant hydropower stations, and they’re using so much of that energy that parts of the state are considering adding more fossil fuel capacity to make ends meet. This results in a “zero-emissions shell game,” Clark wrote in a white paper on the subject.

These early cases are likely only the beginning. A report posted in May from the Electric Power Research Institute predicts energy demand from data centers will double by 2030, a leap driven by the fact that AI queries need ten times more energy than traditional internet searches. The International Energy Agency puts the timeline for doubling sooner–in 2026. Data centers, AI, and the cryptocurrency sector consumed an estimated 460 terawatt-hours (TWh) in 2022, and could reach more than 1000 TWh in 2026, the agency predicts.

Data centers face energy supply challenges

New data centers can be built in a matter of months, but it takes years to build utility-scale power projects, says Poorvi Patel, manager of strategic insights at Electric Power Research Institute and contributor to the report. The potential for unsustainable growth in electricity needs has put grid operators on alert, and in some cases has sent them sounding the alarm. Eirgrid, a state-owned transmission operator in Ireland, last week warned of a “mass exodus” of data centers in Ireland if it can’t connect new sources of energy.

There’s only so much existing nuclear power to go around, and enormous logistical and regulatory roadblocks to building more. So data center operators and tech giants are looking for creative solutions. Some are considering small modular reactors (SMRs)–which are advanced nuclear reactors with much smaller operating capacities than conventional reactors. Nano Nuclear Energy, which is developing microreactors–a particularly small type of SMR–last month announced an agreement with Blockfusion to explore the possibility of powering a currently defunct cryptomining facility in Niagara Falls, New York.

“To me, it does seem like a space where, if big tech has a voracious electric power needs and they really want that 24/7, carbon-free power, nuclear does seem to be the answer,” Clark says. “They also have the balance sheets to be able to do some of the risk mitigation that might make it attractive to get an SMR up and running.”