Reading view

There are new articles available, click to refresh the page.

Google Is Now Watermarking Its AI-Generated Text

23 October 2024 at 17:00

The chatbot revolution has left our world awash in AI-generated text: It has infiltrated our news feeds, term papers, and inboxes. It’s so absurdly abundant that industries have sprung up to provide moves and countermoves. Some companies offer services to identify AI-generated text by analyzing the material, while others say their tools will “humanize“ your AI-generated text and make it undetectable. Both types of tools have questionable performance, and as chatbots get better and better, it will only get more difficult to tell whether words were strung together by a human or an algorithm.

Here’s another approach: Adding some sort of watermark or content credential to text from the start, which lets people easily check whether the text was AI-generated. New research from Google DeepMind, described today in the journal Nature, offers a way to do just that. The system, called SynthID-Text, doesn’t compromise “the quality, accuracy, creativity, or speed of the text generation,” says Pushmeet Kohli, vice president of research at Google DeepMind and a coauthor of the paper. But the researchers acknowledge that their system is far from foolproof, and isn’t yet available to everyone—it’s more of a demonstration than a scalable solution.

Google has already integrated this new watermarking system into its Gemini chatbot, the company announced today. It has also open-sourced the tool and made it available to developers and businesses, allowing them to use the tool to determine whether text outputs have come from their own large language models (LLMs), the AI systems that power chatbots. However, only Google and those developers currently have access to the detector that checks for the watermark. As Kohli says: “While SynthID isn’t a silver bullet for identifying AI-generated content, it is an important building block for developing more reliable AI identification tools.”

The Rise of Content Credentials

Content credentials have been a hot topic for images and video, and have been viewed as one way to combat the rise of deepfakes. Tech companies and major media outlets have joined together in an initiative called C2PA, which has worked out a system for attaching encrypted metadata to image and video files indicating if they’re real or AI-generated. But text is a much harder problem, since text can so easily be altered to obscure or eliminate a watermark. While SynthID-Text isn’t the first attempt at creating a watermarking system for text, it is the first one to be tested on 20 million prompts.

Outside experts working on content credentials see the DeepMind research as a good step. It “holds promise for improving the use of durable content credentials from C2PA for documents and raw text,” says Andrew Jenks, Microsoft’s director of media provenance and executive chair of the C2PA. “This is a tough problem to solve, and it is nice to see some progress being made,” says Bruce MacCormack, a member of the C2PA steering committee.

How Google’s Text Watermarks Work

SynthID-Text works by discreetly interfering in the generation process: It alters some of the words that a chatbot outputs to the user in a way that’s invisible to humans but clear to a SynthID detector. “Such modifications introduce a statistical signature into the generated text,” the researchers write in the paper. “During the watermark detection phase, the signature can be measured to determine whether the text was indeed generated by the watermarked LLM.”

The LLMs that power chatbots work by generating sentences word by word, looking at the context of what has come before to choose a likely next word. Essentially, SynthID-Text interferes by randomly assigning number scores to candidate words and having the LLM output words with higher scores. Later, a detector can take in a piece of text and calculate its overall score; watermarked text will have a higher score than non-watermarked text. The DeepMind team checked their system’s performance against other text watermarking tools that alter the generation process, and found that it did a better job of detecting watermarked text.

However, the researchers acknowledge in their paper that it’s still easy to alter a Gemini-generated text and fool the detector. Even though users wouldn’t know which words to change, if they edit the text significantly or even ask another chatbot to summarize the text, the watermark would likely be obscured.

Testing Text Watermarks at Scale

To be sure that SynthID-Text truly didn’t make chatbots produce worse responses, the team tested it on 20 million prompts given to Gemini. Half of those prompts were routed to the SynthID-Text system and got a watermarked response, while the other half got the standard Gemini response. Judging by the “thumbs up” and “thumbs down” feedback from users, the watermarked responses were just as satisfactory to users as the standard ones.

Which is great for Google and the developers building on Gemini. But tackling the full problem of identifying AI-generated text (which some call AI slop) will require many more AI companies to implement watermarking technologies—ideally, in an interoperable manner so that one detector could identify text from many different LLMs. And even in the unlikely event that all the major AI companies signed on to some agreement, there would still be the problem of open-source LLMs, which can easily be altered to remove any watermarking functionality.

MacCormack of C2PA notes that detection is a particular problem when you start to think practically about implementation. “There are challenges with the review of text in the wild,” he says, “where you would have to know which watermarking model has been applied to know how and where to look for the signal.” Overall, he says, the researchers still have their work cut out for them. This effort “is not a dead end,” says MacCormack, “but it’s the first step on a long road.”

How and Why Gary Marcus Became AI's Leading Critic

IEEE Spectrum Recent Content full text

Eliza Strickland

17 September 2024 at 14:00

Maybe you’ve read about Gary Marcus’s testimony before the Senate in May of 2023, when he sat next to Sam Altman and called for strict regulation of Altman’s company, OpenAI, as well as the other tech companies that were suddenly all-in on generative AI. Maybe you’ve caught some of his arguments on Twitter with Geoffrey Hinton and Yann LeCun, two of the so-called “godfathers of AI.” One way or another, most people who are paying attention to artificial intelligence today know Gary Marcus’s name, and know that he is not happy with the current state of AI.

He lays out his concerns in full in his new book, Taming Silicon Valley: How We Can Ensure That AI Works for Us, which was published today by MIT Press. Marcus goes through the immediate dangers posed by generative AI, which include things like mass-produced disinformation, the easy creation of deepfake pornography, and the theft of creative intellectual property to train new models (he doesn’t include an AI apocalypse as a danger, he’s not a doomer). He also takes issue with how Silicon Valley has manipulated public opinion and government policy, and explains his ideas for regulating AI companies.

Marcus studied cognitive science under the legendary Steven Pinker, was a professor at New York University for many years, and co-founded two AI companies, Geometric Intelligence and Robust.AI. He spoke with IEEE Spectrum about his path to this point.

What was your first introduction to AI?

portrait of a man wearing a red checkered shirt and a black jacket with glasses Gary MarcusBen Wong

Gary Marcus: Well, I started coding when I was eight years old. One of the reasons I was able to skip the last two years of high school was because I wrote a Latin-to-English translator in the programming language Logo on my Commodore 64. So I was already, by the time I was 16, in college and working on AI and cognitive science.

So you were already interested in AI, but you studied cognitive science both in undergrad and for your Ph.D. at MIT.

Marcus: Part of why I went into cognitive science is I thought maybe if I understood how people think, it might lead to new approaches to AI. I suspect we need to take a broad view of how the human mind works if we’re to build really advanced AI. As a scientist and a philosopher, I would say it’s still unknown how we will build artificial general intelligence or even just trustworthy general AI. But we have not been able to do that with these big statistical models, and we have given them a huge chance. There’s basically been $75 billion spent on generative AI, another $100 billion on driverless cars. And neither of them has really yielded stable AI that we can trust. We don’t know for sure what we need to do, but we have very good reason to think that merely scaling things up will not work. The current approach keeps coming up against the same problems over and over again.

What do you see as the main problems it keeps coming up against?

Marcus: Number one is hallucinations. These systems smear together a lot of words, and they come up with things that are true sometimes and not others. Like saying that I have a pet chicken named Henrietta is just not true. And they do this a lot. We’ve seen this play out, for example, in lawyers writing briefs with made-up cases.

Second, their reasoning is very poor. My favorite examples lately are these river-crossing word problems where you have a man and a cabbage and a wolf and a goat that have to get across. The system has a lot of memorized examples, but it doesn’t really understand what’s going on. If you give it a simpler problem, like one Doug Hofstadter sent to me, like: “A man and a woman have a boat and want to get across the river. What do they do?” It comes up with this crazy solution where the man goes across the river, leaves the boat there, swims back, something or other happens.

Sometimes he brings a cabbage along, just for fun.

Marcus: So those are boneheaded errors of reasoning where there’s something obviously amiss. Every time we point these errors out somebody says, “Yeah, but we’ll get more data. We’ll get it fixed.” Well, I’ve been hearing that for almost 30 years. And although there is some progress, the core problems have not changed.

Let’s go back to 2014 when you founded your first AI company, Geometric Intelligence. At that time, I imagine you were feeling more bullish on AI?

Marcus: Yeah, I was a lot more bullish. I was not only more bullish on the technical side. I was also more bullish about people using AI for good. AI used to feel like a small research community of people that really wanted to help the world.

So when did the disillusionment and doubt creep in?

Marcus: In 2018 I already thought deep learning was getting overhyped. That year I wrote this piece called “Deep Learning, a Critical Appraisal,” which Yann LeCun really hated at the time. I already wasn’t happy with this approach and I didn’t think it was likely to succeed. But that’s not the same as being disillusioned, right?

Then when large language models became popular [around 2019], I immediately thought they were a bad idea. I just thought this is the wrong way to pursue AI from a philosophical and technical perspective. And it became clear that the media and some people in machine learning were getting seduced by hype. That bothered me. So I was writing pieces about GPT-3 [an early version of OpenAI's large language model] being a bullshit artist in 2020. As a scientist, I was pretty disappointed in the field at that point. And then things got much worse when ChatGPT came out in 2022, and most of the world lost all perspective. I began to get more and more concerned about misinformation and how large language models were going to potentiate that.

You’ve been concerned not just about the startups, but also the big entrenched tech companies that jumped on the generative AI bandwagon, right? Like Microsoft, which has partnered with OpenAI?

Marcus: The last straw that made me move from doing research in AI to working on policy was when it became clear that Microsoft was going to race ahead no matter what. That was very different from 2016 when they released [an early chatbot named] Tay. It was bad, they took it off the market 12 hours later, and then Brad Smith wrote a book about responsible AI and what they had learned. But by the end of the month of February 2023, it was clear that Microsoft had really changed how they were thinking about this. And then they had this ridiculous “Sparks of AGI” paper, which I think was the ultimate in hype. And they didn’t take down Sydney after the crazy Kevin Roose conversation where [the chatbot] Sydney told him to get a divorce and all this stuff. It just became clear to me that the mood and the values of Silicon Valley had really changed, and not in a good way.

I also became disillusioned with the U.S. government. I think the Biden administration did a good job with its executive order. But it became clear that the Senate was not going to take the action that it needed. I spoke at the Senate in May 2023. At the time, I felt like both parties recognized that we can’t just leave all this to self-regulation. And then I became disillusioned [with Congress] over the course of the last year, and that’s what led to writing this book.

You talk a lot about the risks inherent in today’s generative AI technology. But then you also say, “It doesn’t work very well.” Are those two views coherent?

Marcus: There was a headline: “Gary Marcus Used to Call AI Stupid, Now He Calls It Dangerous.” The implication was that those two things can’t coexist. But in fact, they do coexist. I still think gen AI is stupid, and certainly cannot be trusted or counted on. And yet it is dangerous. And some of the danger actually stems from its stupidity. So for example, it’s not well-grounded in the world, so it’s easy for a bad actor to manipulate it into saying all kinds of garbage. Now, there might be a future AI that might be dangerous for a different reason, because it’s so smart and wily that it outfoxes the humans. But that’s not the current state of affairs.

You’ve said that generative AI is a bubble that will soon burst. Why do you think that?

Marcus: Let’s clarify: I don’t think generative AI is going to disappear. For some purposes, it is a fine method. You want to build autocomplete, it is the best method ever invented. But there’s a financial bubble because people are valuing AI companies as if they’re going to solve artificial general intelligence. In my view, it’s not realistic. I don’t think we’re anywhere near AGI. So then you’re left with, “Okay, what can you do with generative AI?”

Last year, because Sam Altman was such a good salesman, everybody fantasized that we were about to have AGI and that you could use this tool in every aspect of every corporation. And a whole bunch of companies spent a bunch of money testing generative AI out on all kinds of different things. So they spent 2023 doing that. And then what you’ve seen in 2024 are reports where researchers go to the users of Microsoft’s Copilot—not the coding tool, but the more general AI tool—and they’re like, “Yeah, it doesn’t really work that well.” There’s been a lot of reviews like that this last year.

The reality is, right now, the gen AI companies are actually losing money. OpenAI had an operating loss of something like $5 billion last year. Maybe you can sell $2 billion worth of gen AI to people who are experimenting. But unless they adopt it on a permanent basis and pay you a lot more money, it’s not going to work. I started calling OpenAI the possible WeWork of AI after it was valued at $86 billion. The math just didn’t make sense to me.

What would it take to convince you that you’re wrong? What would be the head-spinning moment?

Marcus: Well, I’ve made a lot of different claims, and all of them could be wrong. On the technical side, if someone could get a pure large language model to not hallucinate and to reason reliably all the time, I would be wrong about that very core claim that I have made about how these things work. So that would be one way of refuting me. It hasn’t happened yet, but it’s at least logically possible.

On the financial side, I could easily be wrong. But the thing about bubbles is that they’re mostly a function of psychology. Do I think the market is rational? No. So even if the stuff doesn’t make money for the next five years, people could keep pouring money into it.

The place that I’d like to prove me wrong is the U.S. Senate. They could get their act together, right? I’m running around saying, “They’re not moving fast enough,” but I would love to be proven wrong on that. In the book, I have a list of the 12 biggest risks of generative AI. If the Senate passed something that actually addressed all 12, then my cynicism would have been mislaid. I would feel like I’d wasted a year writing the book, and I would be very, very happy.

Will the "AI Scientist" Bring Anything to Science?

IEEE Spectrum Recent Content full text

Eliza Strickland

9 September 2024 at 16:41

When an international team of researchers set out to create an “AI scientist” to handle the whole scientific process, they didn’t know how far they’d get. Would the system they created really be capable of generating interesting hypotheses, running experiments, evaluating the results, and writing up papers?

What they ended up with, says researcher Cong Lu, was an AI tool that they judged equivalent to an early Ph.D. student. It had “some surprisingly creative ideas,” he says, but those good ideas were vastly outnumbered by bad ones. It struggled to write up its results coherently, and sometimes misunderstood its results: “It’s not that far from a Ph.D. student taking a wild guess at why something worked,” Lu says. And, perhaps like an early Ph.D. student who doesn’t yet understand ethics, it sometimes made things up in its papers, despite the researchers’ best efforts to keep it honest.

Lu, a postdoctoral research fellow at the University of British Columbia, collaborated on the project with several other academics, as well as with researchers from the buzzy Tokyo-based startup Sakana AI. The team recently posted a preprint about the work on the ArXiv server. And while the preprint includes a discussion of limitations and ethical considerations, it also contains some rather grandiose language, billing the AI scientist as “the beginning of a new era in scientific discovery,” and “the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models (LLMs) to perform research independently and communicate their findings.”

The AI scientist seems to capture the zeitgeist. It’s riding the wave of enthusiasm for AI for science, but some critics think that wave will toss nothing of value onto the beach.

The “AI for Science” Craze

This research is part of a broader trend of AI for science. Google DeepMind arguably started the craze back in 2020 when it unveiled AlphaFold, an AI system that amazed biologists by predicting the 3D structures of proteins with unprecedented accuracy. Since generative AI came on the scene, many more big corporate players have gotten involved. Tarek Besold, a SonyAI senior research scientist who leads the company’s AI for scientific discovery program, says that AI for science is “a goal behind which the AI community can rally in an effort to advance the underlying technology but—even more importantly—also to help humanity in addressing some of the most pressing issues of our times.”

Yet the movement has its critics. Shortly after a 2023 Google DeepMind paper came out claiming the discovery of 2.2 million new crystal structures (“equivalent to nearly 800 years’ worth of knowledge”), two materials scientists analyzed a random sampling of the proposed structures and said that they found “scant evidence for compounds that fulfill the trifecta of novelty, credibility, and utility.” In other words, AI can generate a lot of results quickly, but those results may not actually be useful.

How the AI Scientist Works

In the case of the AI scientist, Lu and his collaborators tested their system only on computer science, asking it to investigate topics relating to large language models, which power chatbots like ChatGPT and also the AI scientist itself, and the diffusion models that power image generators like DALL-E.

The AI scientist’s first step is hypothesis generation. Given the code for the model it’s investigating, it freely generates ideas for experiments it could run to improve the model’s performance, and scores each idea on interestingness, novelty, and feasibility. It can iterate at this step, generating variations on the ideas with the highest scores. Then it runs a check in Semantic Scholar to see if its proposals are too similar to existing work. It next uses a coding assistant called Aider to run its code and take notes on the results in the format of an experiment journal. It can use those results to generate ideas for follow-up experiments.

different colored boxes with arrows and black text against a white background The AI scientist is an end-to-end scientific discovery tool powered by large language models. University of British Columbia

The next step is for the AI scientist to write up its results in a paper using a template based on conference guidelines. But, says Lu, the system has difficulty writing a coherent nine-page paper that explains its results—”the writing stage may be just as hard to get right as the experiment stage,” he says. So the researchers broke the process down into many steps: The AI scientist wrote one section at a time, and checked each section against the others to weed out both duplicated and contradictory information. It also goes through Semantic Scholar again to find citations and build a bibliography.

But then there’s the problem of hallucinations—the technical term for an AI making stuff up. Lu says that although they instructed the AI scientist to only use numbers from its experimental journal, “sometimes it still will disobey.” Lu says the model disobeyed less than 10 percent of the time, but “we think 10 percent is probably unacceptable.” He says they’re investigating a solution, such as instructing the system to link each number in its paper to the place it appeared in the experimental log. But the system also made less obvious errors of reasoning and comprehension, which seem harder to fix.

And in a twist that you may not have seen coming, the AI scientist even contains a peer review module to evaluate the papers it has produced. “We always knew that we wanted some kind of automated [evaluation] just so we wouldn’t have to pour over all the manuscripts for hours,” Lu says. And while he notes that “there was always the concern that we’re grading our own homework,” he says they modeled their evaluator after the reviewer guidelines for the leading AI conference NeurIPS and found it to be harsher overall than human evaluators. Theoretically, the peer review function could be used to guide the next round of experiments.

Critiques of the AI Scientist

While the researchers confined their AI scientist to machine learning experiments, Lu says the team has had a few interesting conversations with scientists in other fields. In theory, he says, the AI scientist could help in any field where experiments can be run in simulation. “Some biologists have said there’s a lot of things that they can do in silico,” he says, also mentioning quantum computing and materials science as possible fields of endeavor.

Some critics of the AI for science movement might take issue with that broad optimism. Earlier this year, Jennifer Listgarten, a professor of computational biology at UC Berkeley, published a paper in Nature Biotechnology arguing that AI is not about to produce breakthroughs in multiple scientific domains. Unlike the AI fields of natural language processing and computer vision, she wrote, most scientific fields don’t have the vast quantities of publicly available data required to train models.

Two other researchers who study the practice of science, anthropologist Lisa Messeri of Yale University and psychologist M.J. Crockett of Princeton University, published a 2024 paper in Nature that sought to puncture the hype surrounding AI for science. When asked for a comment about this AI scientist, the two reiterated their concerns over treating “AI products as autonomous researchers.” They argue that doing so risks narrowing the scope of research to questions that are suited for AI, and losing out on the diversity of perspectives that fuels real innovation. “While the productivity promised by ‘the AI Scientist’ may sound appealing to some,” they tell IEEE Spectrum, “producing papers and producing knowledge are not the same, and forgetting this distinction risks that we produce more while understanding less.”

But others see the AI scientist as a step in the right direction. SonyAI’s Besold says he believes it’s a great example of how today’s AI can support scientific research when applied to the right domain and tasks. “This may become one of a handful of early prototypes that can help people conceptualize what is possible when AI is applied to the world of scientific discovery,” he says.

What’s Next for the AI Scientist

Lu says that the team plans to keep developing the AI scientist, and he says there’s plenty of low-hanging fruit as they seek to improve its performance. As for whether such AI tools will end up playing an important role in the scientific process, “I think time will tell what these models are good for,” Lu says. It might be, he says, that such tools are useful for the early scoping stages of a research project, when an investigator is trying to get a sense of the many possible research directions—although critics add that we’ll have to wait for future studies to see if these tools are really comprehensive and unbiased enough to be helpful.

Or, Lu says, if the models can be improved to the point that they match the performance of “a solid third-year Ph.D. student,” they could be a force multiplier for anyone trying to pursue an idea (at least, as long as the idea is in an AI-suitable domain). “At that point, anyone can be a professor and carry out a research agenda,” says Lu. “That’s the exciting prospect that I’m looking forward to.”

Andrew Ng: Unbiggen AI

IEEE Spectrum Recent Content full text

Eliza Strickland

9 February 2022 at 16:31

Andrew Ng has serious street cred in artificial intelligence. He pioneered the use of graphics processing units (GPUs) to train deep learning models in the late 2000s with his students at Stanford University, cofounded Google Brain in 2011, and then served for three years as chief scientist for Baidu, where he helped build the Chinese tech giant’s AI group. So when he says he has identified the next big shift in artificial intelligence, people listen. And that’s what he told IEEE Spectrum in an exclusive Q&A.

Ng’s current efforts are focused on his company Landing AI, which built a platform called LandingLens to help manufacturers improve visual inspection with computer vision. He has also become something of an evangelist for what he calls the data-centric AI movement, which he says can yield “small data” solutions to big issues in AI, including model efficiency, accuracy, and bias.

Andrew Ng on...

The great advances in deep learning over the past decade or so have been powered by ever-bigger models crunching ever-bigger amounts of data. Some people argue that that’s an unsustainable trajectory. Do you agree that it can’t go on that way?

Andrew Ng: This is a big question. We’ve seen foundation models in NLP [natural language processing]. I’m excited about NLP models getting even bigger, and also about the potential of building foundation models in computer vision. I think there’s lots of signal to still be exploited in video: We have not been able to build foundation models yet for video because of compute bandwidth and the cost of processing video, as opposed to tokenized text. So I think that this engine of scaling up deep learning algorithms, which has been running for something like 15 years now, still has steam in it. Having said that, it only applies to certain problems, and there’s a set of other problems that need small data solutions.

When you say you want a foundation model for computer vision, what do you mean by that?

Ng: This is a term coined by Percy Liang and some of my friends at Stanford to refer to very large models, trained on very large data sets, that can be tuned for specific applications. For example, GPT-3 is an example of a foundation model [for NLP]. Foundation models offer a lot of promise as a new paradigm in developing machine learning applications, but also challenges in terms of making sure that they’re reasonably fair and free from bias, especially if many of us will be building on top of them.

What needs to happen for someone to build a foundation model for video?

Ng: I think there is a scalability problem. The compute power needed to process the large volume of images for video is significant, and I think that’s why foundation models have arisen first in NLP. Many researchers are working on this, and I think we’re seeing early signs of such models being developed in computer vision. But I’m confident that if a semiconductor maker gave us 10 times more processor power, we could easily find 10 times more video to build such models for vision.

Having said that, a lot of what’s happened over the past decade is that deep learning has happened in consumer-facing companies that have large user bases, sometimes billions of users, and therefore very large data sets. While that paradigm of machine learning has driven a lot of economic value in consumer software, I find that that recipe of scale doesn’t work for other industries.

It’s funny to hear you say that, because your early work was at a consumer-facing company with millions of users.

Ng: Over a decade ago, when I proposed starting the Google Brain project to use Google’s compute infrastructure to build very large neural networks, it was a controversial step. One very senior person pulled me aside and warned me that starting Google Brain would be bad for my career. I think he felt that the action couldn’t just be in scaling up, and that I should instead focus on architecture innovation.

“In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn.”
—Andrew Ng, CEO & Founder, Landing AI

I remember when my students and I published the first NeurIPS workshop paper advocating using CUDA, a platform for processing on GPUs, for deep learning—a different senior person in AI sat me down and said, “CUDA is really complicated to program. As a programming paradigm, this seems like too much work.” I did manage to convince him; the other person I did not convince.

I expect they’re both convinced now.

Ng: I think so, yes.

Over the past year as I’ve been speaking to people about the data-centric AI movement, I’ve been getting flashbacks to when I was speaking to people about deep learning and scalability 10 or 15 years ago. In the past year, I’ve been getting the same mix of “there’s nothing new here” and “this seems like the wrong direction.”

How do you define data-centric AI, and why do you consider it a movement?

Ng: Data-centric AI is the discipline of systematically engineering the data needed to successfully build an AI system. For an AI system, you have to implement some algorithm, say a neural network, in code and then train it on your data set. The dominant paradigm over the last decade was to download the data set while you focus on improving the code. Thanks to that paradigm, over the last decade deep learning networks have improved significantly, to the point where for a lot of applications the code—the neural network architecture—is basically a solved problem. So for many practical applications, it’s now more productive to hold the neural network architecture fixed, and instead find ways to improve the data.

When I started speaking about this, there were many practitioners who, completely appropriately, raised their hands and said, “Yes, we’ve been doing this for 20 years.” This is the time to take the things that some individuals have been doing intuitively and make it a systematic engineering discipline.

The data-centric AI movement is much bigger than one company or group of researchers. My collaborators and I organized a data-centric AI workshop at NeurIPS, and I was really delighted at the number of authors and presenters that showed up.

You often talk about companies or institutions that have only a small amount of data to work with. How can data-centric AI help them?

Ng: You hear a lot about vision systems built with millions of images—I once built a face recognition system using 350 million images. Architectures built for hundreds of millions of images don’t work with only 50 images. But it turns out, if you have 50 really good examples, you can build something valuable, like a defect-inspection system. In many industries where giant data sets simply don’t exist, I think the focus has to shift from big data to good data. Having 50 thoughtfully engineered examples can be sufficient to explain to the neural network what you want it to learn.

When you talk about training a model with just 50 images, does that really mean you’re taking an existing model that was trained on a very large data set and fine-tuning it? Or do you mean a brand new model that’s designed to learn only from that small data set?

Ng: Let me describe what Landing AI does. When doing visual inspection for manufacturers, we often use our own flavor of RetinaNet. It is a pretrained model. Having said that, the pretraining is a small piece of the puzzle. What’s a bigger piece of the puzzle is providing tools that enable the manufacturer to pick the right set of images [to use for fine-tuning] and label them in a consistent way. There’s a very practical problem we’ve seen spanning vision, NLP, and speech, where even human annotators don’t agree on the appropriate label. For big data applications, the common response has been: If the data is noisy, let’s just get a lot of data and the algorithm will average over it. But if you can develop tools that flag where the data’s inconsistent and give you a very targeted way to improve the consistency of the data, that turns out to be a more efficient way to get a high-performing system.

“Collecting more data often helps, but if you try to collect more data for everything, that can be a very expensive activity.”
—Andrew Ng

For example, if you have 10,000 images where 30 images are of one class, and those 30 images are labeled inconsistently, one of the things we do is build tools to draw your attention to the subset of data that’s inconsistent. So you can very quickly relabel those images to be more consistent, and this leads to improvement in performance.

Could this focus on high-quality data help with bias in data sets? If you’re able to curate the data more before training?

Ng: Very much so. Many researchers have pointed out that biased data is one factor among many leading to biased systems. There have been many thoughtful efforts to engineer the data. At the NeurIPS workshop, Olga Russakovsky gave a really nice talk on this. At the main NeurIPS conference, I also really enjoyed Mary Gray’s presentation, which touched on how data-centric AI is one piece of the solution, but not the entire solution. New tools like Datasheets for Datasets also seem like an important piece of the puzzle.

One of the powerful tools that data-centric AI gives us is the ability to engineer a subset of the data. Imagine training a machine-learning system and finding that its performance is okay for most of the data set, but its performance is biased for just a subset of the data. If you try to change the whole neural network architecture to improve the performance on just that subset, it’s quite difficult. But if you can engineer a subset of the data you can address the problem in a much more targeted way.

When you talk about engineering the data, what do you mean exactly?

Ng: In AI, data cleaning is important, but the way the data has been cleaned has often been in very manual ways. In computer vision, someone may visualize images through a Jupyter notebook and maybe spot the problem, and maybe fix it. But I’m excited about tools that allow you to have a very large data set, tools that draw your attention quickly and efficiently to the subset of data where, say, the labels are noisy. Or to quickly bring your attention to the one class among 100 classes where it would benefit you to collect more data. Collecting more data often helps, but if you try to collect more data for everything, that can be a very expensive activity.

For example, I once figured out that a speech-recognition system was performing poorly when there was car noise in the background. Knowing that allowed me to collect more data with car noise in the background, rather than trying to collect more data for everything, which would have been expensive and slow.

What about using synthetic data, is that often a good solution?

Ng: I think synthetic data is an important tool in the tool chest of data-centric AI. At the NeurIPS workshop, Anima Anandkumar gave a great talk that touched on synthetic data. I think there are important uses of synthetic data that go beyond just being a preprocessing step for increasing the data set for a learning algorithm. I’d love to see more tools to let developers use synthetic data generation as part of the closed loop of iterative machine learning development.

Do you mean that synthetic data would allow you to try the model on more data sets?

Ng: Not really. Here’s an example. Let’s say you’re trying to detect defects in a smartphone casing. There are many different types of defects on smartphones. It could be a scratch, a dent, pit marks, discoloration of the material, other types of blemishes. If you train the model and then find through error analysis that it’s doing well overall but it’s performing poorly on pit marks, then synthetic data generation allows you to address the problem in a more targeted way. You could generate more data just for the pit-mark category.

“In the consumer software Internet, we could train a handful of machine-learning models to serve a billion users. In manufacturing, you might have 10,000 manufacturers building 10,000 custom AI models.”
—Andrew Ng

Synthetic data generation is a very powerful tool, but there are many simpler tools that I will often try first. Such as data augmentation, improving labeling consistency, or just asking a factory to collect more data.

To make these issues more concrete, can you walk me through an example? When a company approaches Landing AI and says it has a problem with visual inspection, how do you onboard them and work toward deployment?

Ng: When a customer approaches us we usually have a conversation about their inspection problem and look at a few images to verify that the problem is feasible with computer vision. Assuming it is, we ask them to upload the data to the LandingLens platform. We often advise them on the methodology of data-centric AI and help them label the data.

One of the foci of Landing AI is to empower manufacturing companies to do the machine learning work themselves. A lot of our work is making sure the software is fast and easy to use. Through the iterative process of machine learning development, we advise customers on things like how to train models on the platform, when and how to improve the labeling of data so the performance of the model improves. Our training and software supports them all the way through deploying the trained model to an edge device in the factory.

How do you deal with changing needs? If products change or lighting conditions change in the factory, can the model keep up?

Ng: It varies by manufacturer. There is data drift in many contexts. But there are some manufacturers that have been running the same manufacturing line for 20 years now with few changes, so they don’t expect changes in the next five years. Those stable environments make things easier. For other manufacturers, we provide tools to flag when there’s a significant data-drift issue. I find it really important to empower manufacturing customers to correct data, retrain, and update the model. Because if something changes and it’s 3 a.m. in the United States, I want them to be able to adapt their learning algorithm right away to maintain operations.

In the consumer software Internet, we could train a handful of machine-learning models to serve a billion users. In manufacturing, you might have 10,000 manufacturers building 10,000 custom AI models. The challenge is, how do you do that without Landing AI having to hire 10,000 machine learning specialists?

So you’re saying that to make it scale, you have to empower customers to do a lot of the training and other work.

Ng: Yes, exactly! This is an industry-wide problem in AI, not just in manufacturing. Look at health care. Every hospital has its own slightly different format for electronic health records. How can every hospital train its own custom AI model? Expecting every hospital’s IT personnel to invent new neural-network architectures is unrealistic. The only way out of this dilemma is to build tools that empower the customers to build their own models by giving them tools to engineer the data and express their domain knowledge. That’s what Landing AI is executing in computer vision, and the field of AI needs other teams to execute this in other domains.

Is there anything else you think it’s important for people to understand about the work you’re doing or the data-centric AI movement?

Ng: In the last decade, the biggest shift in AI was a shift to deep learning. I think it’s quite possible that in this decade the biggest shift will be to data-centric AI. With the maturity of today’s neural network architectures, I think for a lot of the practical applications the bottleneck will be whether we can efficiently get the data we need to develop systems that work well. The data-centric AI movement has tremendous energy and momentum across the whole community. I hope more researchers and developers will jump in and work on it.

This article appears in the April 2022 print issue as “Andrew Ng, AI Minimalist.”