Normal view

There are new articles available, click to refresh the page.
Today — 19 September 2024Main stream

“Dead Internet theory” comes to life with new AI-powered social media app

19 September 2024 at 00:19
People in a hall of mirrors.

Enlarge (credit: gremlin via Getty Images)

For the past few years, a conspiracy theory called "Dead Internet theory" has picked up speed as large language models (LLMs) like ChatGPT increasingly generate text and even social media interactions found online. The theory says that most social Internet activity today is artificial and designed to manipulate humans for engagement.

On Monday, software developer Michael Sayman launched a new AI-populated social network app called SocialAI that feels like it's bringing that conspiracy theory to life, allowing users to interact solely with AI chatbots instead of other humans. It's available on the iPhone app store, but so far, it's picking up pointed criticism.

After its creator announced SocialAI as "a private social network where you receive millions of AI-generated comments offering feedback, advice & reflections on each post you make," computer security specialist Ian Coldwater quipped on X, "This sounds like actual hell." Software developer and frequent AI pundit Colin Fraser expressed a similar sentiment: "I don’t mean this like in a mean way or as a dunk or whatever but this actually sounds like Hell. Like capital H Hell."

Read 11 remaining paragraphs | Comments

Before yesterdayMain stream

OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

12 September 2024 at 21:01
An illustration of a strawberry made out of pixel-like blocks.

Enlarge (credit: Vlatko Gasparic via Getty Images)

OpenAI finally unveiled its rumored "Strawberry" AI language model on Thursday, claiming significant improvements in what it calls "reasoning" and problem-solving capabilities over previous large language models (LLMs). Formally named "OpenAI o1," the model family will initially launch in two forms, o1-preview and o1-mini, available today for ChatGPT Plus and certain API users.

OpenAI claims that o1-preview outperforms its predecessor, GPT-4o, on multiple benchmarks, including competitive programming, mathematics, and "scientific reasoning." However, people who have used the model say it does not yet outclass GPT-4o in every metric. Other users have criticized the delay in receiving a response from the model, owing to the multi-step processing occurring behind the scenes before answering a query.

In a rare display of public hype-busting, OpenAI product manager Joanne Jang tweeted, "There's a lot of o1 hype on my feed, so I'm worried that it might be setting the wrong expectations. what o1 is: the first reasoning model that shines in really hard tasks, and it'll only get better. (I'm personally psyched about the model's potential & trajectory!) what o1 isn't (yet!): a miracle model that does everything better than previous models. you might be disappointed if this is your expectation for today's launch—but we're working to get there!"

Read 18 remaining paragraphs | Comments

Will the "AI Scientist" Bring Anything to Science?

When an international team of researchers set out to create an “AI scientist” to handle the whole scientific process, they didn’t know how far they’d get. Would the system they created really be capable of generating interesting hypotheses, running experiments, evaluating the results, and writing up papers?

What they ended up with, says researcher Cong Lu, was an AI tool that they judged equivalent to an early Ph.D. student. It had “some surprisingly creative ideas,” he says, but those good ideas were vastly outnumbered by bad ones. It struggled to write up its results coherently, and sometimes misunderstood its results: “It’s not that far from a Ph.D. student taking a wild guess at why something worked,” Lu says. And, perhaps like an early Ph.D. student who doesn’t yet understand ethics, it sometimes made things up in its papers, despite the researchers’ best efforts to keep it honest.

Lu, a postdoctoral research fellow at the University of British Columbia, collaborated on the project with several other academics, as well as with researchers from the buzzy Tokyo-based startup Sakana AI. The team recently posted a preprint about the work on the ArXiv server. And while the preprint includes a discussion of limitations and ethical considerations, it also contains some rather grandiose language, billing the AI scientist as “the beginning of a new era in scientific discovery,” and “the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models (LLMs) to perform research independently and communicate their findings.”

The AI scientist seems to capture the zeitgeist. It’s riding the wave of enthusiasm for AI for science, but some critics think that wave will toss nothing of value onto the beach.

The “AI for Science” Craze

This research is part of a broader trend of AI for science. Google DeepMind arguably started the craze back in 2020 when it unveiled AlphaFold, an AI system that amazed biologists by predicting the 3D structures of proteins with unprecedented accuracy. Since generative AI came on the scene, many more big corporate players have gotten involved. Tarek Besold, a SonyAI senior research scientist who leads the company’s AI for scientific discovery program, says that AI for science isa goal behind which the AI community can rally in an effort to advance the underlying technology but—even more importantly—also to help humanity in addressing some of the most pressing issues of our times.”

Yet the movement has its critics. Shortly after a 2023 Google DeepMind paper came out claiming the discovery of 2.2 million new crystal structures (“equivalent to nearly 800 years’ worth of knowledge”), two materials scientists analyzed a random sampling of the proposed structures and said that they found “scant evidence for compounds that fulfill the trifecta of novelty, credibility, and utility.” In other words, AI can generate a lot of results quickly, but those results may not actually be useful.

How the AI Scientist Works

In the case of the AI scientist, Lu and his collaborators tested their system only on computer science, asking it to investigate topics relating to large language models, which power chatbots like ChatGPT and also the AI scientist itself, and the diffusion models that power image generators like DALL-E.

The AI scientist’s first step is hypothesis generation. Given the code for the model it’s investigating, it freely generates ideas for experiments it could run to improve the model’s performance, and scores each idea on interestingness, novelty, and feasibility. It can iterate at this step, generating variations on the ideas with the highest scores. Then it runs a check in Semantic Scholar to see if its proposals are too similar to existing work. It next uses a coding assistant called Aider to run its code and take notes on the results in the format of an experiment journal. It can use those results to generate ideas for follow-up experiments.

different colored boxes with arrows and black text against a white background The AI scientist is an end-to-end scientific discovery tool powered by large language models. University of British Columbia

The next step is for the AI scientist to write up its results in a paper using a template based on conference guidelines. But, says Lu, the system has difficulty writing a coherent nine-page paper that explains its results—”the writing stage may be just as hard to get right as the experiment stage,” he says. So the researchers broke the process down into many steps: The AI scientist wrote one section at a time, and checked each section against the others to weed out both duplicated and contradictory information. It also goes through Semantic Scholar again to find citations and build a bibliography.

But then there’s the problem of hallucinations—the technical term for an AI making stuff up. Lu says that although they instructed the AI scientist to only use numbers from its experimental journal, “sometimes it still will disobey.” Lu says the model disobeyed less than 10 percent of the time, but “we think 10 percent is probably unacceptable.” He says they’re investigating a solution, such as instructing the system to link each number in its paper to the place it appeared in the experimental log. But the system also made less obvious errors of reasoning and comprehension, which seem harder to fix.

And in a twist that you may not have seen coming, the AI scientist even contains a peer review module to evaluate the papers it has produced. “We always knew that we wanted some kind of automated [evaluation] just so we wouldn’t have to pour over all the manuscripts for hours,” Lu says. And while he notes that “there was always the concern that we’re grading our own homework,” he says they modeled their evaluator after the reviewer guidelines for the leading AI conference NeurIPS and found it to be harsher overall than human evaluators. Theoretically, the peer review function could be used to guide the next round of experiments.

Critiques of the AI Scientist

While the researchers confined their AI scientist to machine learning experiments, Lu says the team has had a few interesting conversations with scientists in other fields. In theory, he says, the AI scientist could help in any field where experiments can be run in simulation. “Some biologists have said there’s a lot of things that they can do in silico,” he says, also mentioning quantum computing and materials science as possible fields of endeavor.

Some critics of the AI for science movement might take issue with that broad optimism. Earlier this year, Jennifer Listgarten, a professor of computational biology at UC Berkeley, published a paper in Nature Biotechnology arguing that AI is not about to produce breakthroughs in multiple scientific domains. Unlike the AI fields of natural language processing and computer vision, she wrote, most scientific fields don’t have the vast quantities of publicly available data required to train models.

Two other researchers who study the practice of science, anthropologist Lisa Messeri of Yale University and psychologist M.J. Crockett of Princeton University, published a 2024 paper in Nature that sought to puncture the hype surrounding AI for science. When asked for a comment about this AI scientist, the two reiterated their concerns over treating “AI products as autonomous researchers.” They argue that doing so risks narrowing the scope of research to questions that are suited for AI, and losing out on the diversity of perspectives that fuels real innovation. “While the productivity promised by ‘the AI Scientist’ may sound appealing to some,” they tell IEEE Spectrum, “producing papers and producing knowledge are not the same, and forgetting this distinction risks that we produce more while understanding less.”

But others see the AI scientist as a step in the right direction. SonyAI’s Besold says he believes it’s a great example of how today’s AI can support scientific research when applied to the right domain and tasks. “This may become one of a handful of early prototypes that can help people conceptualize what is possible when AI is applied to the world of scientific discovery,” he says.

What’s Next for the AI Scientist

Lu says that the team plans to keep developing the AI scientist, and he says there’s plenty of low-hanging fruit as they seek to improve its performance. As for whether such AI tools will end up playing an important role in the scientific process, “I think time will tell what these models are good for,” Lu says. It might be, he says, that such tools are useful for the early scoping stages of a research project, when an investigator is trying to get a sense of the many possible research directions—although critics add that we’ll have to wait for future studies to see if these tools are really comprehensive and unbiased enough to be helpful.

Or, Lu says, if the models can be improved to the point that they match the performance of “a solid third-year Ph.D. student,” they could be a force multiplier for anyone trying to pursue an idea (at least, as long as the idea is in an AI-suitable domain). “At that point, anyone can be a professor and carry out a research agenda,” says Lu. “That’s the exciting prospect that I’m looking forward to.”

Apple, Microsoft Shrink AI Models to Improve Them

Tech companies have been caught up in a race to build the biggest large language models (LLMs). In April, for example, Meta announced the 400-billion-parameter Llama 3, which contains twice the number of parameters—or variables that determine how the model responds to queries—than OpenAI’s original ChatGPT model from 2022. Although not confirmed, GPT-4 is estimated to have about 1.8 trillion parameters.

In the last few months, however, some of the largest tech companies, including Apple and Microsoft, have introduced small language models (SLMs). These models are a fraction of the size of their LLM counterparts and yet, on many benchmarks, can match or even outperform them in text generation.

On 10 June, at Apple’s Worldwide Developers Conference, the company announced its “Apple Intelligence” models, which have around 3 billion parameters. And in late April, Microsoft released its Phi-3 family of SLMs, featuring models housing between 3.8 billion and 14 billion parameters.

OpenAI’s CEO Sam Altman believes we’re at the end of the era of giant models.

In a series of tests, the smallest of Microsoft’s models, Phi-3-mini, rivalled OpenAI’s GPT-3.5 (175 billion parameters), which powers the free version of ChatGPT, and outperformed Google’s Gemma (7 billion parameters). The tests evaluated how well a model understands language by prompting it with questions about mathematics, philosophy, law, and more. What’s more interesting, Microsoft’s Phi-3-small, with 7 billion parameters, fared remarkably better than GPT-3.5 in many of these benchmarks.

Aaron Mueller, who researches language models at Northeastern University in Boston, isn’t surprised SLMs can go toe-to-toe with LLMs in select functions. He says that’s because scaling the number of parameters isn’t the only way to improve a model’s performance: Training it on higher-quality data can yield similar results too.

Microsoft’s Phi models were trained on fine-tuned “textbook-quality” data, says Mueller, which have a more consistent style that’s easier to learn from than the highly diverse text from across the Internet that LLMs typically rely on. Similarly, Apple trained its SLMs exclusively on richer and more complex datasets.

The rise of SLMs comes at a time when the performance gap between LLMs is quickly narrowing and tech companies look to deviate from standard scaling laws and explore other avenues for performance upgrades. At an event in April, OpenAI’s CEO Sam Altman said he believes we’re at the end of the era of giant models. “We’ll make them better in other ways.”

Because SLMs don’t consume nearly as much energy as LLMs, they can also run locally on devices like smartphones and laptops (instead of in the cloud) to preserve data privacy and personalize them to each person. In March, Google rolled out Gemini Nano to the company’s Pixel line of smartphones. The SLM can summarize audio recordings and produce smart replies to conversations without an Internet connection. Apple is expected to follow suit later this year.

More importantly, SLMs can democratize access to language models, says Mueller. So far, AI development has been concentrated into the hands of a couple of large companies that can afford to deploy high-end infrastructure, while other, smaller operations and labs have been forced to license them for hefty fees.

Since SLMs can be easily trained on more affordable hardware, says Mueller, they’re more accessible to those with modest resources and yet still capable enough for specific applications.

In addition, while researchers agree there’s still a lot of work ahead to overcome hallucinations, carefully curated SLMs bring them a step closer toward building responsible AI that is also interpretable, which would potentially allow researchers to debug specific LLM issues and fix them at the source.

For researchers like Alex Warstadt, a computer science researcher at ETH Zurich, SLMs could also offer new, fascinating insights into a longstanding scientific question: How children acquire their first language. Warstadt, alongside a group of researchers including Northeastern’s Mueller, organizes BabyLM, a challenge in which participants optimize language-model training on small data.

Not only could SLMs potentially unlock new secrets of human cognition, but they also help improve generative AI. By the time children turn 13, they’re exposed to about 100 million words and are better than chatbots at language, with access to only 0.01 percent of the data. While no one knows what makes humans so much more efficient, says Warstadt, “reverse engineering efficient humanlike learning at small scales could lead to huge improvements when scaled up to LLM scales.”

Is AI Search a Medical Misinformation Disaster?

Last month when Google introduced its new AI search tool, called AI Overviews, the company seemed confident that it had tested the tool sufficiently, noting in the announcement that “people have already used AI Overviews billions of times through our experiment in Search Labs.” The tool doesn’t just return links to Web pages, as in a typical Google search, but returns an answer that it has generated based on various sources, which it links to below the answer. But immediately after the launch users began posting examples of extremely wrong answers, including a pizza recipe that included glue and the interesting fact that a dog has played in the NBA.

A woman with brown hair in a black dress Renée DiResta has been tracking online misinformation for many years as the technical research manager at Stanford’s Internet Observatory.

While the pizza recipe is unlikely to convince anyone to squeeze on the Elmer’s, not all of AI Overview’s extremely wrong answers are so obvious—and some have the potential to be quite harmful. Renée DiResta has been tracking online misinformation for many years as the technical research manager at Stanford’s Internet Observatory and has a new book out about the online propagandists who “turn lies into reality.” She has studied the spread of medical misinformation via social media, so IEEE Spectrum spoke to her about whether AI search is likely to bring an onslaught of erroneous medical advice to unwary users.

I know you’ve been tracking disinformation on the Web for many years. Do you expect the introduction of AI-augmented search tools like Google’s AI Overviews to make the situation worse or better?

Renée DiResta: It’s a really interesting question. There are a couple of policies that Google has had in place for a long time that appear to be in tension with what’s coming out of AI-generated search. That’s made me feel like part of this is Google trying to keep up with where the market has gone. There’s been an incredible acceleration in the release of generative AI tools, and we are seeing Big Tech incumbents trying to make sure that they stay competitive. I think that’s one of the things that’s happening here.

We have long known that hallucinations are a thing that happens with large language models. That’s not new. It’s the deployment of them in a search capacity that I think has been rushed and ill-considered because people expect search engines to give them authoritative information. That’s the expectation you have on search, whereas you might not have that expectation on social media.

There are plenty of examples of comically poor results from AI search, things like how many rocks we should eat per day [a response that was drawn for an Onion article]. But I’m wondering if we should be worried about more serious medical misinformation. I came across one blog post about Google’s AI Overviews responses about stem-cell treatments. The problem there seemed to be that the AI search tool was sourcing its answers from disreputable clinics that were offering unproven treatments. Have you seen other examples of that kind of thing?

DiResta: I have. It’s returning information synthesized from the data that it’s trained on. The problem is that it does not seem to be adhering to the same standards that have long gone into how Google thinks about returning search results for health information. So what I mean by that is Google has, for upwards of 10 years at this point, had a search policy called Your Money or Your Life. Are you familiar with that?

I don’t think so.

DiResta: Your Money or Your Life acknowledges that for queries related to finance and health, Google has a responsibility to hold search results to a very high standard of care, and it’s paramount to get the information correct. People are coming to Google with sensitive questions and they’re looking for information to make materially impactful decisions about their lives. They’re not there for entertainment when they’re asking a question about how to respond to a new cancer diagnosis, for example, or what sort of retirement plan they should be subscribing to. So you don’t want content farms and random Reddit posts and garbage to be the results that are returned. You want to have reputable search results.

That framework of Your Money or Your Life has informed Google’s work on these high-stakes topics for quite some time. And that’s why I think it’s disturbing for people to see the AI-generated search results regurgitating clearly wrong health information from low-quality sites that perhaps happened to be in the training data.

So it seems like AI overviews is not following that same policy—or that’s what it appears like from the outside?

DiResta: That’s how it appears from the outside. I don’t know how they’re thinking about it internally. But those screenshots you’re seeing—a lot of these instances are being traced back to an isolated social media post or a clinic that’s disreputable but exists—are out there on the Internet. It’s not simply making things up. But it’s also not returning what we would consider to be a high-quality result in formulating its response.

I saw that Google responded to some of the problems with a blog post saying that it is aware of these poor results and it’s trying to make improvements. And I can read you the one bullet point that addressed health. It said, “For topics like news and health, we already have strong guardrails in place. In the case of health, we launched additional triggering refinements to enhance our quality protections.” Do you know what that means?

DiResta: That blog posts is an explanation that [AI Overviews] isn’t simply hallucinating—the fact that it’s pointing to URLs is supposed to be a guardrail because that enables the user to go and follow the result to its source. This is a good thing. They should be including those sources for transparency and so that outsiders can review them. However, it is also a fair bit of onus to put on the audience, given the trust that Google has built up over time by returning high-quality results in its health information search rankings.

I know one topic that you’ve tracked over the years has been disinformation about vaccine safety. Have you seen any evidence of that kind of disinformation making its way into AI search?

DiResta: I haven’t, though I imagine outside research teams are now testing results to see what appears. Vaccines have been so much a focus of the conversation around health misinformation for quite some time, I imagine that Google has had people looking specifically at that topic in internal reviews, whereas some of these other topics might be less in the forefront of the minds of the quality teams that are tasked with checking if there are bad results being returned.

What do you think Google’s next moves should be to prevent medical misinformation in AI search?

DiResta: Google has a perfectly good policy to pursue. Your Money or Your Life is a solid ethical guideline to incorporate into this manifestation of the future of search. So it’s not that I think there’s a new and novel ethical grounding that needs to happen. I think it’s more ensuring that the ethical grounding that exists remains foundational to the new AI search tools.

Nvidia Conquers Latest AI Tests​

For years, Nvidia has dominated many machine learning benchmarks, and now there are two more notches in its belt.

MLPerf, the AI benchmarking suite sometimes called “the Olympics of machine learning,” has released a new set of training tests to help make more and better apples-to-apples comparisons between competing computer systems. One of MLPerf’s new tests concerns fine-tuning of large language models, a process that takes an existing trained model and trains it a bit more with specialized knowledge to make it fit for a particular purpose. The other is for graph neural networks, a type of machine learning behind some literature databases, fraud detection in financial systems, and social networks.

Even with the additions and the participation of computers using Google’s and Intel’s AI accelerators, systems powered by Nvidia’s Hopper architecture dominated the results once again. One system that included 11,616 Nvidia H100 GPUs—the largest collection yet—topped each of the nine benchmarks, setting records in five of them (including the two new benchmarks).

“If you just throw hardware at the problem, it’s not a given that you’re going to improve.” —Dave Salvator, Nvidia

The 11,616-H100 system is “the biggest we’ve ever done,” says Dave Salvator, director of accelerated computing products at Nvidia. It smashed through the GPT-3 training trial in less than 3.5 minutes. A 512-GPU system, for comparison, took about 51 minutes. (Note that the GPT-3 task is not a full training, which could take weeks and cost millions of dollars. Instead, the computers train on a representative portion of the data, at an agreed-upon point well before completion.)

Compared to Nvidia’s largest entrant on GPT-3 last year, a 3,584 H100 computer, the 3.5-minute result represents a 3.2-fold improvement. You might expect that just from the difference in the size of these systems, but in AI computing that isn’t always the case, explains Salvator. “If you just throw hardware at the problem, it’s not a given that you’re going to improve,” he says.

“We are getting essentially linear scaling,” says Salvator. By that he means that twice as many GPUs lead to a halved training time. “[That] represents a great achievement from our engineering teams,” he adds.

Competitors are also getting closer to linear scaling. This round Intel deployed a system using 1,024 GPUs that performed the GPT-3 task in 67 minutes versus a computer one-fourth the size that took 224 minutes six months ago. Google’s largest GPT-3 entry used 12-times the number of TPU v5p accelerators as its smallest entry and performed its task nine times as fast.

Linear scaling is going to be particularly important for upcoming “AI factories” housing 100,000 GPUs or more, Salvator says. He says to expect one such data center to come online this year, and another, using Nvidia’s next architecture, Blackwell, to startup in 2025.

Nvidia’s streak continues

Nvidia continued to boost training times despite using the same architecture, Hopper, as it did in last year’s training results. That’s all down to software improvements, says Salvator. “Typically, we’ll get a 2-2.5x [boost] from software after a new architecture is released,” he says.

For GPT-3 training, Nvidia logged a 27 percent improvement from the June 2023 MLPerf benchmarks. Salvator says there were several software changes behind the boost. For example, Nvidia engineers tuned up Hopper’s use of less accurate, 8-bit floating point operations by trimming unnecessary conversions between 8-bit and 16-bit numbers and better targeting of which layers of a neural network could use the lower precision number format. They also found a more intelligent way to adjust the power budget of each chip’s compute engines, and sped communication among GPUs in a way that Salvator likened to “buttering your toast while it’s still in the toaster.”

Additionally, the company implemented a scheme called flash attention. Invented in the Stanford University laboratory of Samba Nova founder Chris Re, flash attention is an algorithm that speeds transformer networks by minimizing writes to memory. When it first showed up in MLPerf benchmarks, flash attention shaved as much as 10 percent from training times. (Intel, too, used a version of flash attention but not for GPT-3. It instead used the algorithm for one of the new benchmarks, fine-tuning.)

Using other software and network tricks, Nvidia delivered an 80 percent speedup in the text-to-image test, Stable Diffusion, versus its submission in November 2023.

New benchmarks

MLPerf adds new benchmarks and upgrades old ones to stay relevant to what’s happening in the AI industry. This year saw the addition of fine-tuning and graph neural networks.

Fine tuning takes an already trained LLM and specializes it for use in a particular field. Nvidia, for example took a trained 43-billion-parameter model and trained it on the GPU-maker’s design files and documentation to create ChipNeMo, an AI intended to boost the productivity of its chip designers. At the time, the company’s chief technology officer Bill Dally said that training an LLM was like giving it a liberal arts education, and fine tuning was like sending it to graduate school.

The MLPerf benchmark takes a pretrained Llama-2-70B model and asks the system to fine tune it using a dataset of government documents with the goal of generating more accurate document summaries.

There are several ways to do fine-tuning. MLPerf chose one called low-rank adaptation (LoRA). The method winds up training only a small portion of the LLM’s parameters leading to a 3-fold lower burden on hardware and reduced use of memory and storage versus other methods, according to the organization.

The other new benchmark involved a graph neural network (GNN). These are for problems that can be represented by a very large set of interconnected nodes, such as a social network or a recommender system. Compared to other AI tasks, GNNs require a lot of communication between nodes in a computer.

The benchmark trained a GNN on a database that shows relationships about academic authors, papers, and institutes—a graph with 547 million nodes and 5.8 billion edges. The neural network was then trained to predict the right label for each node in the graph.

Future fights

Training rounds in 2025 may see head-to-head contests comparing new accelerators from AMD, Intel, and Nvidia. AMD’s MI300 series was launched about six months ago, and a memory-boosted upgrade the MI325x is planned for the end of 2024, with the next generation MI350 slated for 2025. Intel says its Gaudi 3, generally available to computer makers later this year, will appear in MLPerf’s upcoming inferencing benchmarks. Intel executives have said the new chip has the capacity to beat H100 at training LLMs. But the victory may be short-lived, as Nvidia has unveiled a new architecture, Blackwell, which is planned for late this year.

How Large Language Models Are Changing My Job

Generative artificial intelligence, and large language models in particular, are starting to change how countless technical and creative professionals do their jobs. Programmers, for example, are getting code segments by prompting large language models. And graphic arts software packages such as Adobe Illustrator already have tools built in that let designers conjure illustrations, images, or patterns by describing them.

But such conveniences barely hint at the massive, sweeping changes to employment predicted by some analysts. And already, in ways large and small, striking and subtle, the tech world’s notables are grappling with changes, both real and envisioned, wrought by the onset of generative AI. To get a better idea of how some of them view the future of generative AI, IEEE Spectrum asked three luminaries—an academic leader, a regulator, and a semiconductor industry executive—about how generative AI has begun affecting their work. The three, Andrea Goldsmith, Juraj Čorba, and Samuel Naffziger, agreed to speak with Spectrum at the 2024 IEEE VIC Summit & Honors Ceremony Gala, held in May in Boston.

Click to read more thoughts from:

  1. Andrea Goldsmith, dean of engineering at Princeton University.
  2. Juraj Čorba, senior expert on digital regulation and governance, Slovak Ministry of Investments, Regional Development
  3. Samuel Naffziger, senior vice president and a corporate fellow at Advanced Micro Devices

Andrea Goldsmith

Andrea Goldsmith is dean of engineering at Princeton University.

There must be tremendous pressure now to throw a lot of resources into large language models. How do you deal with that pressure? How do you navigate this transition to this new phase of AI?

A woman with brown shoulder length hair smiles for a portrait in a teal jacket in an outside scene Andrea J. Goldsmith

Andrea Goldsmith: Universities generally are going to be very challenged, especially universities that don’t have the resources of a place like Princeton or MIT or Stanford or the other Ivy League schools. In order to do research on large language models, you need brilliant people, which all universities have. But you also need compute power and you need data. And the compute power is expensive, and the data generally sits in these large companies, not within universities.

So I think universities need to be more creative. We at Princeton have invested a lot of money in the computational resources for our researchers to be able to do—well, not large language models, because you can’t afford it. To do a large language model… look at OpenAI or Google or Meta. They’re spending hundreds of millions of dollars on compute power, if not more. Universities can’t do that.

But we can be more nimble and creative. What can we do with language models, maybe not large language models but with smaller language models, to advance the state of the art in different domains? Maybe it’s vertical domains of using, for example, large language models for better prognosis of disease, or for prediction of cellular channel changes, or in materials science to decide what’s the best path to pursue a particular new material that you want to innovate on. So universities need to figure out how to take the resources that we have to innovate using AI technology.

We also need to think about new models. And the government can also play a role here. The [U.S.] government has this new initiative, NAIRR, or National Artificial Intelligence Research Resource, where they’re going to put up compute power and data and experts for educators to use—researchers and educators.

That could be a game-changer because it’s not just each university investing their own resources or faculty having to write grants, which are never going to pay for the compute power they need. It’s the government pulling together resources and making them available to academic researchers. So it’s an exciting time, where we need to think differently about research—meaning universities need to think differently. Companies need to think differently about how to bring in academic researchers, how to open up their compute resources and their data for us to innovate on.

As a dean, you are in a unique position to see which technical areas are really hot, attracting a lot of funding and attention. But how much ability do you have to steer a department and its researchers into specific areas? Of course, I’m thinking about large language models and generative AI. Is deciding on a new area of emphasis or a new initiative a collaborative process?

Goldsmith: Absolutely. I think any academic leader who thinks that their role is to steer their faculty in a particular direction does not have the right perspective on leadership. I describe academic leadership as really about the success of the faculty and students that you’re leading. And when I did my strategic planning for Princeton Engineering in the fall of 2020, everything was shut down. It was the middle of COVID, but I’m an optimist. So I said, “Okay, this isn’t how I expected to start as dean of engineering at Princeton.” But the opportunity to lead engineering in a great liberal arts university that has aspirations to increase the impact of engineering hasn’t changed. So I met with every single faculty member in the School of Engineering, all 150 of them, one-on-one over Zoom.

And the question I asked was, “What do you aspire to? What should we collectively aspire to?” And I took those 150 responses, and I asked all the leaders and the departments and the centers and the institutes, because there already were some initiatives in robotics and bioengineering and in smart cities. And I said, “I want all of you to come up with your own strategic plans. What do you aspire to in these areas? And then let’s get together and create a strategic plan for the School of Engineering.” So that’s what we did. And everything that we’ve accomplished in the last four years that I’ve been dean came out of those discussions, and what it was the faculty and the faculty leaders in the school aspired to.

So we launched a bioengineering institute last summer. We just launched Princeton Robotics. We’ve launched some things that weren’t in the strategic plan that bubbled up. We launched a center on blockchain technology and its societal implications. We have a quantum initiative. We have an AI initiative using this powerful tool of AI for engineering innovation, not just around large language models, but it’s a tool—how do we use it to advance innovation and engineering? All of these things came from the faculty because, to be a successful academic leader, you have to realize that everything comes from the faculty and the students. You have to harness their enthusiasm, their aspirations, their vision to create a collective vision.

Juraj Čorba

Juraj Čorba is senior expert on digital regulation and governance, Slovak Ministry of Investments, Regional Development, and Information, and Chair of the Working Party on Governance of AI at the Organization for Economic Cooperation and Development.

What are the most important organizations and governing bodies when it comes to policy and governance on artificial intelligence in Europe?

Portrait of a clean-shaven man with brown hair wearing a blue button down shirt. Juraj Čorba

Juraj Čorba: Well, there are many. And it also creates a bit of a confusion around the globe—who are the actors in Europe? So it’s always good to clarify. First of all we have the European Union, which is a supranational organization composed of many member states, including my own Slovakia. And it was the European Union that proposed adoption of a horizontal legislation for AI in 2021. It was the initiative of the European Commission, the E.U. institution, which has a legislative initiative in the E.U. And the E.U. AI Act is now finally being adopted. It was already adopted by the European Parliament.

So this started, you said 2021. That’s before ChatGPT and the whole large language model phenomenon really took hold.

Čorba: That was the case. Well, the expert community already knew that something was being cooked in the labs. But, yes, the whole agenda of large models, including large language models, came up only later on, after 2021. So the European Union tried to reflect that. Basically, the initial proposal to regulate AI was based on a blueprint of so-called product safety, which somehow presupposes a certain intended purpose. In other words, the checks and assessments of products are based more or less on the logic of the mass production of the 20th century, on an industrial scale, right? Like when you have products that you can somehow define easily and all of them have a clearly intended purpose. Whereas with these large models, a new paradigm was arguably opened, where they have a general purpose.

So the whole proposal was then rewritten in negotiations between the Council of Ministers, which is one of the legislative bodies, and the European Parliament. And so what we have today is a combination of this old product-safety approach and some novel aspects of regulation specifically designed for what we call general-purpose artificial intelligence systems or models. So that’s the E.U.

By product safety, you mean, if AI-based software is controlling a machine, you need to have physical safety.

Čorba: Exactly. That’s one of the aspects. So that touches upon the tangible products such as vehicles, toys, medical devices, robotic arms, et cetera. So yes. But from the very beginning, the proposal contained a regulation of what the European Commission called stand-alone systems—in other words, software systems that do not necessarily command physical objects. So it was already there from the very beginning, but all of it was based on the assumption that all software has its easily identifiable intended purpose—which is not the case for general-purpose AI.

Also, large language models and generative AI in general brings in this whole other dimension, of propaganda, false information, deepfakes, and so on, which is different from traditional notions of safety in real-time software.

Čorba: Well, this is exactly the aspect that is handled by another European organization, different from the E.U., and that is the Council of Europe. It’s an international organization established after the Second World War for the protection of human rights, for protection of the rule of law, and protection of democracy. So that’s where the Europeans, but also many other states and countries, started to negotiate a first international treaty on AI. For example, the United States have participated in the negotiations, and also Canada, Japan, Australia, and many other countries. And then these particular aspects, which are related to the protection of integrity of elections, rule-of-law principles, protection of fundamental rights or human rights under international law—all these aspects have been dealt with in the context of these negotiations on the first international treaty, which is to be now adopted by the Committee of Ministers of the Council of Europe on the 16th and 17th of May. So, pretty soon. And then the first international treaty on AI will be submitted for ratifications.

So prompted largely by the activity in large language models, AI regulation and governance now is a hot topic in the United States, in Europe, and in Asia. But of the three regions, I get the sense that Europe is proceeding most aggressively on this topic of regulating and governing artificial intelligence. Do you agree that Europe is taking a more proactive stance in general than the United States and Asia?

Čorba: I’m not so sure. If you look at the Chinese approach and the way they regulate what we call generative AI, it would appear to me that they also take it very seriously. They take a different approach from the regulatory point of view. But it seems to me that, for instance, China is taking a very focused and careful approach. For the United States, I wouldn’t say that the United States is not taking a careful approach because last year you saw many of the executive orders, or even this year, some of the executive orders issued by President Biden. Of course, this was not a legislative measure, this was a presidential order. But it seems to me that the United States is also trying to address the issue very actively. The United States has also initiated the first resolution of the General Assembly at the U.N. on AI, which was passed just recently. So I wouldn’t say that the E.U. is more aggressive in comparison with Asia or North America, but maybe I would say that the E.U. is the most comprehensive. It looks horizontally across different agendas and it uses binding legislation as a tool, which is not always the case around the world. Many countries simply feel that it’s too early to legislate in a binding way, so they opt for soft measures or guidance, collaboration with private companies, et cetera. Those are the differences that I see.

Do you think you perceive a difference in focus among the three regions? Are there certain aspects that are being more aggressively pursued in the United States than in Europe or vice versa?

Čorba: Certainly the E.U. is very focused on the protection of human rights, the full catalog of human rights, but also, of course, on safety and human health. These are the core goals or values to be protected under the E.U. legislation. As for the United States and for China, I would say that the primary focus in those countries—but this is only my personal impression—is on national and economic security.

Samuel Naffziger

Samuel Naffziger is senior vice president and a corporate fellow at Advanced Micro Devices, where he is responsible for technology strategy and product architectures. Naffziger was instrumental in AMD’s embrace and development of chiplets, which are semiconductor dies that are packaged together into high-performance modules.

To what extent is large language model training starting to influence what you and your colleagues do at AMD?

Portrait of a brown haired man in a dark blue shirt. Samuel Naffziger

Samuel Naffziger: Well, there are a couple levels of that. LLMs are impacting the way a lot of us live and work. And we certainly are deploying that very broadly internally for productivity enhancements, for using LLMs to provide starting points for code—simple verbal requests, such as “Give me a Python script to parse this dataset.” And you get a really nice starting point for that code. Saves a ton of time. Writing verification test benches, helping with the physical design layout optimizations. So there’s a lot of productivity aspects.

The other aspect to LLMs is, of course, we are actively involved in designing GPUs [graphics processing units] for LLM training and for LLM inference. And so that’s driving a tremendous amount of workload analysis on the requirements, hardware requirements, and hardware-software codesign, to explore.

So that brings us to your current flagship, the Instinct MI300X, which is actually billed as an AI accelerator. How did the particular demands influence that design? I don’t know when that design started, but the ChatGPT era started about two years ago or so. To what extent did you read the writing on the wall?

Naffziger: So we were just into the MI300—in 2019, we were starting the development. A long time ago. And at that time, our revenue stream from the Zen [an AMD architecture used in a family of processors] renaissance had really just started coming in. So the company was starting to get healthier, but we didn’t have a lot of extra revenue to spend on R&D at the time. So we had to be very prudent with our resources. And we had strategic engagements with the [U.S.] Department of Energy for supercomputer deployments. That was the genesis for our MI line—we were developing it for the supercomputing market. Now, there was a recognition that munching through FP64 COBOL code, or Fortran, isn’t the future, right? [laughs] This machine-learning [ML] thing is really getting some legs.

So we put some of the lower-precision math formats in, like Brain Floating Point 16 at the time, that were going to be important for inference. And the DOE knew that machine learning was going to be an important dimension of supercomputers, not just legacy code. So that’s the way, but we were focused on HPC [high-performance computing]. We had the foresight to understand that ML had real potential. Although certainly no one predicted, I think, the explosion we’ve seen today.

So that’s how it came about. And, just another piece of it: We leveraged our modular chiplet expertise to architect the 300 to support a number of variants from the same silicon components. So the variant targeted to the supercomputer market had CPUs integrated in as chiplets, directly on the silicon module. And then it had six of the GPU chiplets we call XCDs around them. So we had three CPU chiplets and six GPU chiplets. And that provided an amazingly efficient, highly integrated, CPU-plus-GPU design we call MI300A. It’s very compelling for the El Capitan supercomputer that’s being brought up as we speak.

But we also recognize that for the maximum computation for these AI workloads, the CPUs weren’t that beneficial. We wanted more GPUs. For these workloads, it’s all about the math and matrix multiplies. So we were able to just swap out those three CPU chiplets for a couple more XCD GPUs. And so we got eight XCDs in the module, and that’s what we call the MI300X. So we kind of got lucky having the right product at the right time, but there was also a lot of skill involved in that we saw the writing on the wall for where these workloads were going and we provisioned the design to support it.

Earlier you mentioned 3D chiplets. What do you feel is the next natural step in that evolution?

Naffziger: AI has created this bottomless thirst for more compute [power]. And so we are always going to be wanting to cram as many transistors as possible into a module. And the reason that’s beneficial is, these systems deliver AI performance at scale with thousands, tens of thousands, or more, compute devices. They all have to be tightly connected together, with very high bandwidths, and all of that bandwidth requires power, requires very expensive infrastructure. So if a certain level of performance is required—a certain number of petaflops, or exaflops—the strongest lever on the cost and the power consumption is the number of GPUs required to achieve a zettaflop, for instance. And if the GPU is a lot more capable, then all of that system infrastructure collapses down—if you only need half as many GPUs, everything else goes down by half. So there’s a strong economic motivation to achieve very high levels of integration and performance at the device level. And the only way to do that is with chiplets and with 3D stacking. So we’ve already embarked down that path. A lot of tough engineering problems to solve to get there, but that’s going to continue.

And so what’s going to happen? Well, obviously we can add layers, right? We can pack more in. The thermal challenges that come along with that are going to be fun engineering problems that our industry is good at solving.

1-bit LLMs Could Solve AI’s Energy Demands

Large language models, the AI systems that power chatbots like ChatGPT, are getting better and better—but they’re also getting bigger and bigger, demanding more energy and computational power. For LLMs that are cheap, fast, and environmentally friendly, they’ll need to shrink, ideally small enough to run directly on devices like cellphones. Researchers are finding ways to do just that by drastically rounding off the many high-precision numbers that store their memories to equal just 1 or -1.

LLMs, like all neural networks, are trained by altering the strengths of connections between their artificial neurons. These strengths are stored as mathematical parameters. Researchers have long compressed networks by reducing the precision of these parameters—a process called quantization—so that instead of taking up 16 bits each, they might take up 8 or 4. Now researchers are pushing the envelope to a single bit.

How to Make a 1-bit LLM

There are two general approaches. One approach, called post-training quantization (PTQ) is to quantize the parameters of a full-precision network. The other approach, quantization-aware training (QAT), is to train a network from scratch to have low-precision parameters. So far, PTQ has been more popular with researchers.

In February, a team including Haotong Qin at ETH Zurich, Xianglong Liu at Beihang University, and Wei Huang at the University of Hong Kong introduced a PTQ method called BiLLM. It approximates most parameters in a network using 1 bit, but represents a few salient weights—those most influential to performance—using 2 bits. In one test, the team binarized a version of Meta’s LLaMa LLM that has 13 billion parameters.

“One-bit LLMs open new doors for designing custom hardware and systems specifically optimized for 1-bit LLMs.” —Furu Wei, Microsoft Research Asia

To score performance, the researchers used a metric called perplexity, which is basically a measure of how surprised the trained model was by each ensuing piece of text. For one dataset, the original model had a perplexity of around 5, and the BiLLM version scored around 15, much better than the closest binarization competitor, which scored around 37 (for perplexity, lower numbers are better). That said, the BiLLM model required about a tenth of the memory capacity as the original.

PTQ has several advantages over QAT, says Wanxiang Che, a computer scientist at Harbin Institute of Technology, in China. It doesn’t require collecting training data, it doesn’t require training a model from scratch, and the training process is more stable. QAT, on the other hand, has the potential to make models more accurate, since quantization is built into the model from the beginning.

1-bit LLMs Find Success Against Their Larger Cousins

Last year, a team led by Furu Wei and Shuming Ma, at Microsoft Research Asia, in Beijing, created BitNet, the first 1-bit QAT method for LLMs. After fiddling with the rate at which the network adjusts its parameters, in order to stabilize training, they created LLMs that performed better than those created using PTQ methods. They were still not as good as full-precision networks, but roughly 10 times as energy efficient.

In February, Wei’s team announced BitNet 1.58b, in which parameters can equal -1, 0, or 1, which means they take up roughly 1.58 bits of memory per parameter. A BitNet model with 3 billion parameters performed just as well on various language tasks as a full-precision LLaMA model with the same number of parameters and amount of training, but it was 2.71 times as fast, used 72 percent less GPU memory, and used 94 percent less GPU energy. Wei called this an “aha moment.” Further, the researchers found that as they trained larger models, efficiency advantages improved.

A BitNet model with 3 billion parameters performed just as well on various language tasks as a full-precision LLaMA model.

This year, a team led by Che, of Harbin Institute of Technology, released a preprint on another LLM binarization method, called OneBit. OneBit combines elements of both PTQ and QAT. It uses a full-precision pretrained LLM to generate data for training a quantized version. The team’s 13-billion-parameter model achieved a perplexity score of around 9 on one dataset, versus 5 for a LLaMA model with 13 billion parameters. Meanwhile, OneBit occupied only 10 percent as much memory. On customized chips, it could presumably run much faster.

Wei, of Microsoft, says quantized models have multiple advantages. They can fit on smaller chips, they require less data transfer between memory and processors, and they allow for faster processing. Current hardware can’t take full advantage of these models, though. LLMs often run on GPUs like those made by Nvidia, which represent weights using higher precision and spend most of their energy multiplying them. New hardware could natively represent each parameter as a -1 or 1 (or 0), and then simply add and subtract values and avoid multiplication. “One-bit LLMs open new doors for designing custom hardware and systems specifically optimized for 1-bit LLMs,” Wei says.

“They should grow up together,” Huang, of the University of Hong Kong, says of 1-bit models and processors. “But it’s a long way to develop new hardware.”
