Normal view

There are new articles available, click to refresh the page.
Today — 17 September 2024Main stream
Yesterday — 16 September 2024Main stream

Why we need an AI safety hotline

16 September 2024 at 11:00

In the past couple of years, regulators have been caught off guard again and again as tech companies compete to launch ever more advanced AI models. It’s only a matter of time before labs release another round of models that pose new regulatory challenges. We’re likely just weeks away, for example, from OpenAI’s release of ChatGPT-5, which promises to push AI capabilities further than ever before. As it stands, it seems there’s little anyone can do to delay or prevent the release of a model that poses excessive risks.

Testing AI models before they’re released is a common approach to mitigating certain risks, and it may help regulators weigh up the costs and benefits—and potentially block models from being released if they’re deemed too dangerous. But the accuracy and comprehensiveness of these tests leaves a lot to be desired. AI models may “sandbag” the evaluation—hiding some of their capabilities to avoid raising any safety concerns. The evaluations may also fail to reliably uncover the full set of risks posed by any one model. Evaluations likewise suffer from limited scope—current tests are unlikely to uncover all the risks that warrant further investigation. There’s also the question of who conducts the evaluations and how their biases may influence testing efforts. For those reasons, evaluations need to be used alongside other governance tools. 

One such tool could be internal reporting mechanisms within the labs. Ideally, employees should feel empowered to regularly and fully share their AI safety concerns with their colleagues, and they should feel those colleagues can then be counted on to act on the concerns. However, there’s growing evidence that, far from being promoted, open criticism is becoming rarer in AI labs. Just three months ago, 13 former and current workers from OpenAI and other labs penned an open letter expressing fear of retaliation if they attempt to disclose questionable corporate behaviors that fall short of breaking the law. 

How to sound the alarm

In theory, external whistleblower protections could play a valuable role in the detection of AI risks. These could protect employees fired for disclosing corporate actions, and they could help make up for inadequate internal reporting mechanisms. Nearly every state has a public policy exception to at-will employment termination—in other words, terminated employees can seek recourse against their employers if they were retaliated against for calling out unsafe or illegal corporate practices. However, in practice this exception offers employees few assurances. Judges tend to favor employers in whistleblower cases. The likelihood of AI labs’ surviving such suits seems particularly high given that society has yet to reach any sort of consensus as to what qualifies as unsafe AI development and deployment. 

These and other shortcomings explain why the aforementioned 13 AI workers, including ex-OpenAI employee William Saunders, called for a novel “right to warn.” Companies would have to offer employees an anonymous process for disclosing risk-related concerns to the lab’s board, a regulatory authority, and an independent third body made up of subject-matter experts. The ins and outs of this process have yet to be figured out, but it would presumably be a formal, bureaucratic mechanism. The board, regulator, and third party would all need to make a record of the disclosure. It’s likely that each body would then initiate some sort of investigation. Subsequent meetings and hearings also seem like a necessary part of the process. Yet if Saunders is to be taken at his word, what AI workers really want is something different. 

When Saunders went on the Big Technology Podcast to outline his ideal process for sharing safety concerns, his focus was not on formal avenues for reporting established risks. Instead, he indicated a desire for some intermediate, informal step. He wants a chance to receive neutral, expert feedback on whether a safety concern is substantial enough to go through a “high stakes” process such as a right-to-warn system. Current government regulators, as Saunders says, could not serve that role. 

For one thing, they likely lack the expertise to help an AI worker think through safety concerns. What’s more, few workers will pick up the phone if they know it’s a government official on the other end—that sort of call may be “very intimidating,” as Saunders himself said on the podcast. Instead, he envisages being able to call an expert to discuss his concerns. In an ideal scenario, he’d be told that the risk in question does not seem that severe or likely to materialize, freeing him up to return to whatever he was doing with more peace of mind. 

Lowering the stakes

What Saunders is asking for in this podcast isn’t a right to warn, then, as that suggests the employee is already convinced there’s unsafe or illegal activity afoot. What he’s really calling for is a gut check—an opportunity to verify whether a suspicion of unsafe or illegal behavior seems warranted. The stakes would be much lower, so the regulatory response could be lighter. The third party responsible for weighing up these gut checks could be a much more informal one. For example, AI PhD students, retired AI industry workers, and other individuals with AI expertise could volunteer for an AI safety hotline. They could be tasked with quickly and expertly discussing safety matters with employees via a confidential and anonymous phone conversation. Hotline volunteers would have familiarity with leading safety practices, as well as extensive knowledge of what options, such as right-to-warn mechanisms, may be available to the employee. 

As Saunders indicated, few employees will likely want to go from 0 to 100 with their safety concerns—straight from colleagues to the board or even a government body. They are much more likely to raise their issues if an intermediary, informal step is available.

Studying examples elsewhere

The details of how precisely an AI safety hotline would work deserve more debate among AI community members, regulators, and civil society. For the hotline to realize its full potential, for instance, it may need some way to escalate the most urgent, verified reports to the appropriate authorities. How to ensure the confidentiality of hotline conversations is another matter that needs thorough investigation. How to recruit and retain volunteers is another key question. Given leading experts’ broad concern about AI risk, some may be willing to participate simply out of a desire to lend a hand. Should too few folks step forward, other incentives may be necessary. The essential first step, though, is acknowledging this missing piece in the puzzle of AI safety regulation. The next step is looking for models to emulate in building out the first AI hotline. 

One place to start is with ombudspersons. Other industries have recognized the value of identifying these neutral, independent individuals as resources for evaluating the seriousness of employee concerns. Ombudspersons exist in academia, nonprofits, and the private sector. The distinguishing attribute of these individuals and their staffers is neutrality—they have no incentive to favor one side or the other, and thus they’re more likely to be trusted by all. A glance at the use of ombudspersons in the federal government shows that when they are available, issues may be raised and resolved sooner than they would be otherwise.

This concept is relatively new. The US Department of Commerce established the first federal ombudsman in 1971. The office was tasked with helping citizens resolve disputes with the agency and investigate agency actions. Other agencies, including the Social Security Administration and the Internal Revenue Service, soon followed suit. A retrospective review of these early efforts concluded that effective ombudspersons can meaningfully improve citizen-government relations. On the whole, ombudspersons were associated with an uptick in voluntary compliance with regulations and cooperation with the government. 

An AI ombudsperson or safety hotline would surely have different tasks and staff from an ombudsperson in a federal agency. Nevertheless, the general concept is worthy of study by those advocating safeguards in the AI industry. 

A right to warn may play a role in getting AI safety concerns aired, but we need to set up more intermediate, informal steps as well. An AI safety hotline is low-hanging regulatory fruit. A pilot made up of volunteers could be organized in relatively short order and provide an immediate outlet for those, like Saunders, who merely want a sounding board.

Kevin Frazier is an assistant professor at St. Thomas University College of Law and senior research fellow in the Constitutional Studies Program at the University of Texas at Austin.

Before yesterdayMain stream

Amazon's Secret Weapon in Chip Design Is Amazon



Big-name makers of processors, especially those geared toward cloud-based AI, such as AMD and Nvidia, have been showing signs of wanting to own more of the business of computing, purchasing makers of software, interconnects, and servers. The hope is that control of the “full stack” will give them an edge in designing what their customers want.

Amazon Web Services (AWS) got there ahead of most of the competition, when they purchased chip designer Annapurna Labs in 2015 and proceeded to design CPUs, AI accelerators, servers, and data centers as a vertically-integrated operation. Ali Saidi, the technical lead for the Graviton series of CPUs, and Rami Sinno, director of engineering at Annapurna Labs, explained the advantage of vertically-integrated design and Amazon-scale and showed IEEE Spectrum around the company’s hardware testing labs in Austin, Tex., on 27 August.

Saidi and Sinno on:

What brought you to Amazon Web Services, Rami?

an older man in an eggplant colored polo shirt posing for a portrait Rami SinnoAWS

Rami Sinno: Amazon is my first vertically integrated company. And that was on purpose. I was working at Arm, and I was looking for the next adventure, looking at where the industry is heading and what I want my legacy to be. I looked at two things:

One is vertically integrated companies, because this is where most of the innovation is—the interesting stuff is happening when you control the full hardware and software stack and deliver directly to customers.

And the second thing is, I realized that machine learning, AI in general, is going to be very, very big. I didn’t know exactly which direction it was going to take, but I knew that there is something that is going to be generational, and I wanted to be part of that. I already had that experience prior when I was part of the group that was building the chips that go into the Blackberries; that was a fundamental shift in the industry. That feeling was incredible, to be part of something so big, so fundamental. And I thought, “Okay, I have another chance to be part of something fundamental.”

Does working at a vertically-integrated company require a different kind of chip design engineer?

Sinno: Absolutely. When I hire people, the interview process is going after people that have that mindset. Let me give you a specific example: Say I need a signal integrity engineer. (Signal integrity makes sure a signal going from point A to point B, wherever it is in the system, makes it there correctly.) Typically, you hire signal integrity engineers that have a lot of experience in analysis for signal integrity, that understand layout impacts, can do measurements in the lab. Well, this is not sufficient for our group, because we want our signal integrity engineers also to be coders. We want them to be able to take a workload or a test that will run at the system level and be able to modify it or build a new one from scratch in order to look at the signal integrity impact at the system level under workload. This is where being trained to be flexible, to think outside of the little box has paid off huge dividends in the way that we do development and the way we serve our customers.

“By the time that we get the silicon back, the software’s done” —Ali Saidi, Annapurna Labs

At the end of the day, our responsibility is to deliver complete servers in the data center directly for our customers. And if you think from that perspective, you’ll be able to optimize and innovate across the full stack. A design engineer or a test engineer should be able to look at the full picture because that’s his or her job, deliver the complete server to the data center and look where best to do optimization. It might not be at the transistor level or at the substrate level or at the board level. It could be something completely different. It could be purely software. And having that knowledge, having that visibility, will allow the engineers to be significantly more productive and delivery to the customer significantly faster. We’re not going to bang our head against the wall to optimize the transistor where three lines of code downstream will solve these problems, right?

Do you feel like people are trained in that way these days?

Sinno: We’ve had very good luck with recent college grads. Recent college grads, especially the past couple of years, have been absolutely phenomenal. I’m very, very pleased with the way that the education system is graduating the engineers and the computer scientists that are interested in the type of jobs that we have for them.

The other place that we have been super successful in finding the right people is at startups. They know what it takes, because at a startup, by definition, you have to do so many different things. People who’ve done startups before completely understand the culture and the mindset that we have at Amazon.

[back to top]

What brought you to AWS, Ali?

a man with a beard wearing a polka dotted button-up shirt posing for a portrait Ali SaidiAWS

Ali Saidi: I’ve been here about seven and a half years. When I joined AWS, I joined a secret project at the time. I was told: “We’re going to build some Arm servers. Tell no one.”

We started with Graviton 1. Graviton 1 was really the vehicle for us to prove that we could offer the same experience in AWS with a different architecture.

The cloud gave us an ability for a customer to try it in a very low-cost, low barrier of entry way and say, “Does it work for my workload?” So Graviton 1 was really just the vehicle demonstrate that we could do this, and to start signaling to the world that we want software around ARM servers to grow and that they’re going to be more relevant.

Graviton 2—announced in 2019—was kind of our first… what we think is a market-leading device that’s targeting general-purpose workloads, web servers, and those types of things.

It’s done very well. We have people running databases, web servers, key-value stores, lots of applications... When customers adopt Graviton, they bring one workload, and they see the benefits of bringing that one workload. And then the next question they ask is, “Well, I want to bring some more workloads. What should I bring?” There were some where it wasn’t powerful enough effectively, particularly around things like media encoding, taking videos and encoding them or re-encoding them or encoding them to multiple streams. It’s a very math-heavy operation and required more [single-instruction multiple data] bandwidth. We need cores that could do more math.

We also wanted to enable the [high-performance computing] market. So we have an instance type called HPC 7G where we’ve got customers like Formula One. They do computational fluid dynamics of how this car is going to disturb the air and how that affects following cars. It’s really just expanding the portfolio of applications. We did the same thing when we went to Graviton 4, which has 96 cores versus Graviton 3’s 64.

[back to top]

How do you know what to improve from one generation to the next?

Saidi: Far and wide, most customers find great success when they adopt Graviton. Occasionally, they see performance that isn’t the same level as their other migrations. They might say “I moved these three apps, and I got 20 percent higher performance; that’s great. But I moved this app over here, and I didn’t get any performance improvement. Why?” It’s really great to see the 20 percent. But for me, in the kind of weird way I am, the 0 percent is actually more interesting, because it gives us something to go and explore with them.

Most of our customers are very open to those kinds of engagements. So we can understand what their application is and build some kind of proxy for it. Or if it’s an internal workload, then we could just use the original software. And then we can use that to kind of close the loop and work on what the next generation of Graviton will have and how we’re going to enable better performance there.

What’s different about designing chips at AWS?

Saidi: In chip design, there are many different competing optimization points. You have all of these conflicting requirements, you have cost, you have scheduling, you’ve got power consumption, you’ve got size, what DRAM technologies are available and when you’re going to intersect them… It ends up being this fun, multifaceted optimization problem to figure out what’s the best thing that you can build in a timeframe. And you need to get it right.

One thing that we’ve done very well is taken our initial silicon to production.

How?

Saidi: This might sound weird, but I’ve seen other places where the software and the hardware people effectively don’t talk. The hardware and software people in Annapurna and AWS work together from day one. The software people are writing the software that will ultimately be the production software and firmware while the hardware is being developed in cooperation with the hardware engineers. By working together, we’re closing that iteration loop. When you are carrying the piece of hardware over to the software engineer’s desk your iteration loop is years and years. Here, we are iterating constantly. We’re running virtual machines in our emulators before we have the silicon ready. We are taking an emulation of [a complete system] and running most of the software we’re going to run.

So by the time that we get to the silicon back [from the foundry], the software’s done. And we’ve seen most of the software work at this point. So we have very high confidence that it’s going to work.

The other piece of it, I think, is just being absolutely laser-focused on what we are going to deliver. You get a lot of ideas, but your design resources are approximately fixed. No matter how many ideas I put in the bucket, I’m not going to be able to hire that many more people, and my budget’s probably fixed. So every idea I throw in the bucket is going to use some resources. And if that feature isn’t really important to the success of the project, I’m risking the rest of the project. And I think that’s a mistake that people frequently make.

Are those decisions easier in a vertically integrated situation?

Saidi: Certainly. We know we’re going to build a motherboard and a server and put it in a rack, and we know what that looks like… So we know the features we need. We’re not trying to build a superset product that could allow us to go into multiple markets. We’re laser-focused into one.

What else is unique about the AWS chip design environment?

Saidi: One thing that’s very interesting for AWS is that we’re the cloud and we’re also developing these chips in the cloud. We were the first company to really push on running [electronic design automation (EDA)] in the cloud. We changed the model from “I’ve got 80 servers and this is what I use for EDA” to “Today, I have 80 servers. If I want, tomorrow I can have 300. The next day, I can have 1,000.”

We can compress some of the time by varying the resources that we use. At the beginning of the project, we don’t need as many resources. We can turn a lot of stuff off and not pay for it effectively. As we get to the end of the project, now we need many more resources. And instead of saying, “Well, I can’t iterate this fast, because I’ve got this one machine, and it’s busy.” I can change that and instead say, “Well, I don’t want one machine; I’ll have 10 machines today.”

Instead of my iteration cycle being two days for a big design like this, instead of being even one day, with these 10 machines I can bring it down to three or four hours. That’s huge.

How important is Amazon.com as a customer?

Saidi: They have a wealth of workloads, and we obviously are the same company, so we have access to some of those workloads in ways that with third parties, we don’t. But we also have very close relationships with other external customers.

So last Prime Day, we said that 2,600 Amazon.com services were running on Graviton processors. This Prime Day, that number more than doubled to 5,800 services running on Graviton. And the retail side of Amazon used over 250,000 Graviton CPUs in support of the retail website and the services around that for Prime Day.

[back to top]

The AI accelerator team is colocated with the labs that test everything from chips through racks of servers. Why?

Sinno: So Annapurna Labs has multiple labs in multiple locations as well. This location here is in Austin… is one of the smaller labs. But what’s so interesting about the lab here in Austin is that you have all of the hardware and many software development engineers for machine learning servers and for Trainium and Inferentia [AWS’s AI chips] effectively co-located on this floor. For hardware developers, engineers, having the labs co-located on the same floor has been very, very effective. It speeds execution and iteration for delivery to the customers. This lab is set up to be self-sufficient with anything that we need to do, at the chip level, at the server level, at the board level. Because again, as I convey to our teams, our job is not the chip; our job is not the board; our job is the full server to the customer.

How does vertical integration help you design and test chips for data-center-scale deployment?

Sinno: It’s relatively easy to create a bar-raising server. Something that’s very high-performance, very low-power. If we create 10 of them, 100 of them, maybe 1,000 of them, it’s easy. You can cherry pick this, you can fix this, you can fix that. But the scale that the AWS is at is significantly higher. We need to train models that require 100,000 of these chips. 100,000! And for training, it’s not run in five minutes. It’s run in hours or days or weeks even. Those 100,000 chips have to be up for the duration. Everything that we do here is to get to that point.

We start from a “what are all the things that can go wrong?” mindset. And we implement all the things that we know. But when you were talking about cloud scale, there are always things that you have not thought of that come up. These are the 0.001-percent type issues.

In this case, we do the debug first in the fleet. And in certain cases, we have to do debugs in the lab to find the root cause. And if we can fix it immediately, we fix it immediately. Being vertically integrated, in many cases we can do a software fix for it. We use our agility to rush a fix while at the same time making sure that the next generation has it already figured out from the get go.

[back to top]

How Is Axim Collaborative Spending $800 Million From the Sale of EdX?

13 September 2024 at 12:00

One of the country’s richest nonprofits focused on online education has been giving out grants for more than a year. But so far, the group, known as Axim Collaborative, has done so slowly — and pretty quietly.

“There has been little buzz about them in digital learning circles,” says Russ Poulin, executive director of WCET, a nonprofit focused on digital learning in higher education. “They are not absent from the conversation, but their name is not raised very often.”

Late last month, an article in the online course review site Class Central put it more starkly, calling the promise of the nonprofit “hollow.” The op-ed, by longtime online education watcher Dhawal Shah, noted that according to the group’s most recent tax return, Axim is sitting on $735 million and had expenses of just $9 million in tax year 2023, with $15 million in revenue from investment income. “Instead of being an innovator, Axim Collaborative seems to be a non-entity in the edtech space, its promises of innovation and equity advancement largely unfulfilled,” Shah wrote.

The group was formed with the money made when Harvard University and MIT sold their edX online platform to for-profit company 2U in 2021 for about $800 million. At the time many online learning leaders criticized the move, since edX had long touted its nonprofit status as differentiating it from competitors like Coursera. The purchase did not end up working out as planned for 2U, which this summer filed for bankruptcy.

So what is Axim investing in? And what are its future plans?

EdSurge reached out to Axim’s CEO, Stephanie Khurana, to get an update.

Not surprisingly, she pushed back on the idea that the group is not doing much.

“We’ve launched 18 partnerships over the past year,” she says, noting that many grants Axim has awarded were issued since its most recent tax return was filed. “It’s a start, and it’s seeding a lot of innovations. And that to me is very powerful.”

One of the projects she says she is most proud of is Axim’s work with HBCUv, a collaboration by several historically Black colleges to create a shared technology platform and framework to share online courses across their campuses. While money was part of that, Khurana says she is also proud of the work her group did helping set up a course-sharing framework. Axim also plans to help with “incorporating student success metrics in the platform itself,” she says, “so people can see where they might be able to support students with different kinds of advising and different kinds of student supports.”

The example embodies the group’s philosophy of trying to provide expertise and convening power, rather than just cash, to help promising ideas scale to support underserved learners in higher education.

Listening Tour

When EdSurge talked with Khurana last year, she stressed that her first step would be to listen and learn across the online learning community to see where the group could best make a difference.

One thing that struck her as she did that, she says, is “hearing what barriers students are facing, and what's keeping them from persisting through their programs and finding jobs that match with their skills and being able to actually realize better outcomes.”

Grant amounts the group has given out so far range from around $500,000 for what she called “demonstration projects” to as much as $3 million.

Artificial intelligence has emerged as a key focus of Axim’s work, though Khurana says the group is treading gingerly.

“We are looking very carefully at how and where AI is beneficial, and where it might be problematic, especially for these underserved learners,” she says. “And so trying to be clear-eyed about what those possibilities are, and then bring to bear the most promising opportunities for the students and institutions that we're supporting.”

One specific AI project the group has supported is a collaboration between Axim, Campus Evolve, University of Central Florida and Indiana Tech to explore research-based approaches to using AI to improve student advising. “They're developing an AI tool to have a student-facing approach to understanding, ‘What are my academic resources? What are career-based resources?,’” she says. “A lot of times those are hard to discern.”

Another key work of Axim involves keeping up an old system rather than starting new ones. The Axim Collaborative manages the Open edX platform, the open-source system that hosts edX courses and can also be used by any institution with the tech know-how and the computer servers to run it. The platform is used by thousands of colleges and organizations around the world, including a growing number of governments, who use it to offer online courses.

Anant Agarwal, who helped found edX and now works at 2U to coordinate its use, is also on a technical committee for Open edX.

He says the structure of supporting Open edX through Axim is modeled on the way the Linux open-source operating system is managed.

While edX continues to rely on the platform, the software is community-run. “There has to be somebody that maintains the repositories, maintains the release schedule and provides funding for certain projects,” Agarwal says. And that group is now Axim.

When the war in Ukraine broke out, Agarwal says, the country “turned on a dime and the universities and schools started offering courses on Open edX.”

Poulin, of WCET, says that it’s too early to say whether Axim’s model is working.

“While their profile and impact may not be great to this point, I am willing to give startups some runway time to determine if they will take off,” he says, noting that “Axim is, essentially, still a startup.”

His advice: “A creative, philanthropic organization should take some risks if they are working in the ‘innovation’ sphere. We learn as much from failures as successes.”

For Khurana, Axim’s CEO, the goal is not to find a magic answer to deep-seated problems facing higher education.

“I know some people want something that will be a silver bullet,” she says. “And I think it's just hard to come by in a space where there's a lot of different ways to solve problems. Starting with people on the ground who are doing the work — [with] humility — is probably one of the best ways to seed innovations and to start.”

© Mojahid Mottakin / Shutterstock

How Is Axim Collaborative Spending $800 Million From the Sale of EdX?

Chatbots can persuade people to stop believing in conspiracy theories

12 September 2024 at 20:00

The internet has made it easier than ever before to encounter and spread conspiracy theories. And while some are harmless, others can be deeply damaging, sowing discord and even leading to unnecessary deaths.

Now, researchers believe they’ve uncovered a new tool for combating false conspiracy theories: AI chatbots. Researchers from MIT Sloan and Cornell University found that chatting about a conspiracy theory with a large language model (LLM) reduced people’s belief in it by about 20%—even among participants who claimed that their beliefs were important to their identity. The research is published today in the journal Science.

The findings could represent an important step forward in how we engage with and educate people who espouse such baseless theories, says Yunhao (Jerry) Zhang, a postdoc fellow affiliated with the Psychology of Technology Institute who studies AI’s impacts on society.

“They show that with the help of large language models, we can—I wouldn’t say solve it, but we can at least mitigate this problem,” he says. “It points out a way to make society better.” 

Few interventions have been proven to change conspiracy theorists’ minds, says Thomas Costello, a research affiliate at MIT Sloan and the lead author of the study. Part of what makes it so hard is that different people tend to latch on to different parts of a theory. This means that while presenting certain bits of factual evidence may work on one believer, there’s no guarantee that it’ll prove effective on another.

That’s where AI models come in, he says. “They have access to a ton of information across diverse topics, and they’ve been trained on the internet. Because of that, they have the ability to tailor factual counterarguments to particular conspiracy theories that people believe.”

The team tested its method by asking 2,190 crowdsourced workers to participate in text conversations with GPT-4 Turbo, OpenAI’s latest large language model.

Participants were asked to share details about a conspiracy theory they found credible, why they found it compelling, and any evidence they felt supported it. These answers were used to tailor responses from the chatbot, which the researchers had prompted to be as persuasive as possible.

Participants were also asked to indicate how confident they were that their conspiracy theory was true, on a scale from 0 (definitely false) to 100 (definitely true), and then rate how important the theory was to their understanding of the world. Afterwards, they entered into three rounds of conversation with the AI bot. The researchers chose three to make sure they could collect enough substantive dialogue.

After each conversation, participants were asked the same rating questions. The researchers followed up with all the participants 10 days after the experiment, and then two months later, to assess whether their views had changed following the conversation with the AI bot. The participants reported a 20% reduction of belief in their chosen conspiracy theory on average, suggesting that talking to the bot had fundamentally changed some people’s minds.

“Even in a lab setting, 20% is a large effect on changing people’s beliefs,” says Zhang. “It might be weaker in the real world, but even 10% or 5% would still be very substantial.”

The authors sought to safeguard against AI models’ tendency to make up information—known as hallucinating—by employing a professional fact-checker to evaluate the accuracy of 128 claims the AI had made. Of these, 99.2% were found to be true, while 0.8% were deemed misleading. None were found to be completely false. 

One explanation for this high degree of accuracy is that a lot has been written about conspiracy theories on the internet, making them very well represented in the model’s training data, says David G. Rand, a professor at MIT Sloan who also worked on the project. The adaptable nature of GPT-4 Turbo means it could easily be connected to different platforms for users to interact with in the future, he adds.

“You could imagine just going to conspiracy forums and inviting people to do their own research by debating the chatbot,” he says. “Similarly, social media could be hooked up to LLMs to post corrective responses to people sharing conspiracy theories, or we could buy Google search ads against conspiracy-related search terms like ‘Deep State.’”

The research upended the authors’ preconceived notions about how receptive people were to solid evidence debunking not only conspiracy theories, but also other beliefs that are not rooted in good-quality information, says Gordon Pennycook, an associate professor at Cornell University who also worked on the project. 

“People were remarkably responsive to evidence. And that’s really important,” he says. “Evidence does matter.”

AI Conversations Help Conspiracy Theorists Change Their Views

12 September 2024 at 15:09
AI-powered conversations can reduce belief in conspiracy theories by 20%. Researchers found that AI provided tailored, fact-based rebuttals to participants' conspiracy claims, leading to a lasting change in their beliefs. In one out of four cases, participants disavowed the conspiracy entirely. The study suggests that AI has the potential to combat misinformation by engaging people directly and personally.

Google’s new tool lets large language models fact-check their responses

12 September 2024 at 15:00

As long as chatbots have been around, they have made things up. Such “hallucinations” are an inherent part of how AI models work. However, they’re a big problem for companies betting big on AI, like Google, because they make the responses it generates unreliable. 

Google is releasing a tool today to address the issue. Called DataGemma, it uses two methods to help large language models fact-check their responses against reliable data and cite their sources more transparently to users. 

The first of the two methods is called Retrieval-Interleaved Generation (RIG), which acts as a sort of fact-checker. If a user prompts the model with a question—like “Has the use of renewable energy sources increased in the world?”—the model will come up with a “first draft” answer. Then RIG identifies what portions of the draft answer could be checked against Google’s Data Commons, a massive repository of data and statistics from reliable sources like the United Nations or the Centers for Disease Control and Prevention. Next, it runs those checks and replaces any incorrect original guesses with correct facts. It also cites its sources to the user.

The second method, which is commonly used in other large language models, is called Retrieval-Augmented Generation (RAG). Consider a prompt like “What progress has Pakistan made against global health goals?” In response, the model examines which data in the Data Commons could help it answer the question, such as information about access to safe drinking water, hepatitis B immunizations, and life expectancies. With those figures in hand, the model then builds its answer on top of the data and cites its sources.

“Our goal here was to use Data Commons to enhance the reasoning of LLMs by grounding them in real-world statistical data that you could source back to where you got it from,” says Prem Ramaswami, head of Data Commons at Google. Doing so, he says, will “create more trustable, reliable AI.”

It is only available to researchers for now, but Ramaswami says access could widen further after more testing. If it works as hoped, it could be a real boon for Google’s plan to embed AI deeper into its search engine.  

However, it comes with a host of caveats. First, the usefulness of the methods is limited by whether the relevant data is in the Data Commons, which is more of a data repository than an encyclopedia. It can tell you the GDP of Iran, but it’s unable to confirm the date of the First Battle of Fallujah or when Taylor Swift released her most recent single. In fact, Google’s researchers found that with about 75% of the test questions, the RIG method was unable to obtain any usable data from the Data Commons. And even if helpful data is indeed housed in the Data Commons, the model doesn’t always formulate the right questions to find it. 

Second, there is the question of accuracy. When testing the RAG method, researchers found that the model gave incorrect answers 6% to 20% of the time. Meanwhile, the RIG method pulled the correct stat from Data Commons only about 58% of the time (though that’s a big improvement over the 5% to 17% accuracy rate of Google’s large language models when they’re not pinging Data Commons). 

Ramaswami says DataGemma’s accuracy will improve as it gets trained on more and more data. The initial version has been trained on only about 700 questions, and fine-tuning the model required his team to manually check each individual fact it generated. To further improve the model, the team plans to increase that data set from hundreds of questions to millions.

To be more useful, robots need to become lazier

9 September 2024 at 17:12

Robots perceive the world around them very differently from the way humans do. 

When we walk down the street, we know what we need to pay attention to—passing cars, potential dangers, obstacles in our way—and what we don’t, like pedestrians walking in the distance. Robots, on the other hand, treat all the information they receive about their surroundings with equal importance. Driverless cars, for example, have to continuously analyze data about things around them whether or not they are relevant. This keeps drivers and pedestrians safe, but it draws on a lot of energy and computing power. What if there’s a way to cut that down by teaching robots what they should prioritize and what they can safely ignore?

That’s the principle underpinning “lazy robotics,” a field of study championed by René van de Molengraft, a professor at Eindhoven University of Technology in the Netherlands. He believes that teaching all kinds of robots to be “lazier” with their data could help pave the way for machines that are better at interacting with things in their real-world environments, including humans. Essentially, the more efficient a robot can be with information, the better.

Van de Molengraft’s lazy robotics is just one approach researchers and robotics companies are now taking as they train their robots to complete actions successfully, flexibly, and in the most efficient manner possible.

Teaching them to be smarter when they sift through the data they gather and then de-prioritize anything that’s safe to overlook will help make them safer and more reliable—a long-standing goal of the robotics community.

Simplifying tasks in this way is necessary if robots are to become more widely adopted, says Van de Molengraft, because their current energy usage won’t scale—it would be prohibitively expensive and harmful to the environment. “I think that the best robot is a lazy robot,” he says. “They should be lazy by default, just like we are.”

Learning to be lazier

Van de Molengraft has hit upon a fun way to test these efforts out: teaching robots to play soccer. He recently led his university’s autonomous robot soccer team, Tech United, to victory at RoboCup, an annual international robotics and AI competition that tests robots’ skills on the soccer field. Soccer is a tough challenge for robots, because both scoring and blocking goals require quick, controlled movements, strategic decision-making, and coordination. 

Learning to focus and tune out distractions around them, much as the best human soccer players do, will make them not only more energy efficient (especially for robots powered by batteries) but more likely to make smarter decisions in dynamic, fast-moving situations.

Tech United’s robots used several “lazy” tactics to give them an edge over their opponents during the RoboCup. One approach involved creating a “world model” of a soccer pitch that identifies and maps out its layout and line markings—things that remain the same throughout the game. This frees the battery-powered robots from constantly scanning their surroundings, which would waste precious power. Each robot also shares what its camera is capturing with its four teammates, creating a broader view of the pitch to help keep track of the fast-moving ball. 

Previously, the robots needed a precise, pre-coded trajectory to move around the pitch. Now Van de Molengraft and his team are experimenting with having them choose their own paths to a specified destination. This saves the energy needed to track a specific journey and helps the robots cope with obstacles they may encounter along the way.

The group also successfully taught the squad to execute “penetrating passes”—where a robot shoots toward an open region in the field and communicates to the best-positioned member of its team to receive it—and skills such as receiving or passing the ball within configurations such as triangles. Giving the robots access to world models built using data from the surrounding environment allows them to execute their skills anywhere on the pitch, instead of just in specific spots.

Beyond the soccer pitch

While soccer is a fun way to test how successful these robotics methods are, other researchers are also working on the problem of efficiency—and dealing with much higher stakes.

Making robots that work in warehouses better at prioritizing different data inputs is essential to ensuring that they can operate safely around humans and be relied upon to complete tasks, for example. If the machines can’t manage this, companies could end up with a delayed shipment, damaged goods, an injured human worker—or worse, says Chris Walti, the former head of Tesla’s robotics division. 

Walti left the company to set up his own firm after witnessing how challenging it was to get robots to simply move materials around. His startup, Mytra, designs fully autonomous machines that use computer vision and an AI reinforcement-learning system to give them awareness of other robots closest to them, and to help them reason and collaborate to complete tasks (like moving a broken pallet) in much more computationally efficient ways. 

The majority of mobile robots in warehouses today are controlled by a single central “brain” that dictates the paths they follow, meaning a robot has to wait for instructions before it can do anything. Not only is this approach difficult to scale, but it consumes a lot of central computing power and requires very dependable communication links.

Mytra believes it’s hit upon a significantly more efficient approach, which acknowledges that individual robots don’t really need to know what hundreds of other robots are doing on the other side of the warehouse. Its machine-learning system cuts down on this unnecessary data, and the computing power it would take to process it, by simulating the optimal route each robot can take through the warehouse to perform its task. This enables them to act much more autonomously. 

“In the context of soccer, being efficient allows you to score more goals. In the context of manufacturing, being efficient is even more important because it means a system operates more reliably,” he says. “By providing robots with the ability to to act and think autonomously and efficiently, you’re also optimizing the efficiency and the reliability of the broader operation.”

While simplifying the types of information that robots need to process is a major challenge, inroads are being made, says Daniel Polani, a professor from the University of Hertfordshire in the UK who specializes in replicating biological processes in artificial systems. He’s also a fan of the RoboCup challenge—in fact, he leads his university’s Bold Hearts robot soccer team, which made it to the second round of this year’s RoboCup’s humanoid league.

“Organisms try not to process information that they don’t need to because that processing is very expensive, in terms of metabolic energy,” he says. Polani is interested in applying these  lessons from biology to the vast networks that power robots to make them more efficient with their information. Reducing the amount of information a robot is allowed to process will just make it weaker depending on the nature of the task it’s been given, he says. Instead, they should learn to use the data they have in more intelligent ways.

Simplifying software

Amazon, which has more than 750,000 robots, the largest such fleet in the world, is also interested in using AI to help them make smarter, safer, and more efficient decisions. Amazon’s robots mostly fall into two categories: mobile robots that move stock, and robotic arms designed to handle objects. The AI systems that power these machines collect millions of data points every day to help train them to complete their tasks. For example, they must learn which item to grasp and move from a pile, or how to safely avoid human warehouse workers. These processes require a lot of computing power, which the new techniques can help minimize.

Generally, robotic arms and similar “manipulation” robots use machine learning to figure out how to identify objects, for example. Then they follow hard-coded rules or algorithms to decide how to act. With generative AI, these same robots can predict the outcome of an action before even attempting it, so they can choose the action most likely to succeed or determine the best possible approach to grasping an object that needs to be moved. 

These learning systems are much more scalable than traditional methods of training robots, and the combination of generative AI and massive data sets helps streamline the sequencing of a task and cut out layers of unnecessary analysis. That’s where the savings in computing power come in. “We can simplify the software by asking the models to do more,” says Michael Wolf, a principal scientist at Amazon Robotics. “We are entering a phase where we’re fundamentally rethinking how we build autonomy for our robotic systems.”

Achieving more by doing less

This year’s RoboCup competition may be over, but Van de Molengraft isn’t resting on his laurels after his team’s resounding success. “There’s still a lot of computational activities going on in each of the robots that are not per se necessary at each moment in time,” he says. He’s already starting work on new ways to make his robotic team even lazier to gain an edge on its rivals next year.  

Although current robots are still nowhere near able to match the energy efficiency of humans, he’s optimistic that researchers will continue to make headway and that we’ll start to see a lot more lazy robots that are better at their jobs. But it won’t happen overnight. “Increasing our robots’ awareness and understanding so that they can better perform their tasks, be it football or any other task in basically any domain in human-built environments—that’s a continuous work in progress,” he says.

Will the "AI Scientist" Bring Anything to Science?



When an international team of researchers set out to create an “AI scientist” to handle the whole scientific process, they didn’t know how far they’d get. Would the system they created really be capable of generating interesting hypotheses, running experiments, evaluating the results, and writing up papers?

What they ended up with, says researcher Cong Lu, was an AI tool that they judged equivalent to an early Ph.D. student. It had “some surprisingly creative ideas,” he says, but those good ideas were vastly outnumbered by bad ones. It struggled to write up its results coherently, and sometimes misunderstood its results: “It’s not that far from a Ph.D. student taking a wild guess at why something worked,” Lu says. And, perhaps like an early Ph.D. student who doesn’t yet understand ethics, it sometimes made things up in its papers, despite the researchers’ best efforts to keep it honest.

Lu, a postdoctoral research fellow at the University of British Columbia, collaborated on the project with several other academics, as well as with researchers from the buzzy Tokyo-based startup Sakana AI. The team recently posted a preprint about the work on the ArXiv server. And while the preprint includes a discussion of limitations and ethical considerations, it also contains some rather grandiose language, billing the AI scientist as “the beginning of a new era in scientific discovery,” and “the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models (LLMs) to perform research independently and communicate their findings.”

The AI scientist seems to capture the zeitgeist. It’s riding the wave of enthusiasm for AI for science, but some critics think that wave will toss nothing of value onto the beach.

The “AI for Science” Craze

This research is part of a broader trend of AI for science. Google DeepMind arguably started the craze back in 2020 when it unveiled AlphaFold, an AI system that amazed biologists by predicting the 3D structures of proteins with unprecedented accuracy. Since generative AI came on the scene, many more big corporate players have gotten involved. Tarek Besold, a SonyAI senior research scientist who leads the company’s AI for scientific discovery program, says that AI for science isa goal behind which the AI community can rally in an effort to advance the underlying technology but—even more importantly—also to help humanity in addressing some of the most pressing issues of our times.”

Yet the movement has its critics. Shortly after a 2023 Google DeepMind paper came out claiming the discovery of 2.2 million new crystal structures (“equivalent to nearly 800 years’ worth of knowledge”), two materials scientists analyzed a random sampling of the proposed structures and said that they found “scant evidence for compounds that fulfill the trifecta of novelty, credibility, and utility.” In other words, AI can generate a lot of results quickly, but those results may not actually be useful.

How the AI Scientist Works

In the case of the AI scientist, Lu and his collaborators tested their system only on computer science, asking it to investigate topics relating to large language models, which power chatbots like ChatGPT and also the AI scientist itself, and the diffusion models that power image generators like DALL-E.

The AI scientist’s first step is hypothesis generation. Given the code for the model it’s investigating, it freely generates ideas for experiments it could run to improve the model’s performance, and scores each idea on interestingness, novelty, and feasibility. It can iterate at this step, generating variations on the ideas with the highest scores. Then it runs a check in Semantic Scholar to see if its proposals are too similar to existing work. It next uses a coding assistant called Aider to run its code and take notes on the results in the format of an experiment journal. It can use those results to generate ideas for follow-up experiments.

different colored boxes with arrows and black text against a white background The AI scientist is an end-to-end scientific discovery tool powered by large language models. University of British Columbia

The next step is for the AI scientist to write up its results in a paper using a template based on conference guidelines. But, says Lu, the system has difficulty writing a coherent nine-page paper that explains its results—”the writing stage may be just as hard to get right as the experiment stage,” he says. So the researchers broke the process down into many steps: The AI scientist wrote one section at a time, and checked each section against the others to weed out both duplicated and contradictory information. It also goes through Semantic Scholar again to find citations and build a bibliography.

But then there’s the problem of hallucinations—the technical term for an AI making stuff up. Lu says that although they instructed the AI scientist to only use numbers from its experimental journal, “sometimes it still will disobey.” Lu says the model disobeyed less than 10 percent of the time, but “we think 10 percent is probably unacceptable.” He says they’re investigating a solution, such as instructing the system to link each number in its paper to the place it appeared in the experimental log. But the system also made less obvious errors of reasoning and comprehension, which seem harder to fix.

And in a twist that you may not have seen coming, the AI scientist even contains a peer review module to evaluate the papers it has produced. “We always knew that we wanted some kind of automated [evaluation] just so we wouldn’t have to pour over all the manuscripts for hours,” Lu says. And while he notes that “there was always the concern that we’re grading our own homework,” he says they modeled their evaluator after the reviewer guidelines for the leading AI conference NeurIPS and found it to be harsher overall than human evaluators. Theoretically, the peer review function could be used to guide the next round of experiments.

Critiques of the AI Scientist

While the researchers confined their AI scientist to machine learning experiments, Lu says the team has had a few interesting conversations with scientists in other fields. In theory, he says, the AI scientist could help in any field where experiments can be run in simulation. “Some biologists have said there’s a lot of things that they can do in silico,” he says, also mentioning quantum computing and materials science as possible fields of endeavor.

Some critics of the AI for science movement might take issue with that broad optimism. Earlier this year, Jennifer Listgarten, a professor of computational biology at UC Berkeley, published a paper in Nature Biotechnology arguing that AI is not about to produce breakthroughs in multiple scientific domains. Unlike the AI fields of natural language processing and computer vision, she wrote, most scientific fields don’t have the vast quantities of publicly available data required to train models.

Two other researchers who study the practice of science, anthropologist Lisa Messeri of Yale University and psychologist M.J. Crockett of Princeton University, published a 2024 paper in Nature that sought to puncture the hype surrounding AI for science. When asked for a comment about this AI scientist, the two reiterated their concerns over treating “AI products as autonomous researchers.” They argue that doing so risks narrowing the scope of research to questions that are suited for AI, and losing out on the diversity of perspectives that fuels real innovation. “While the productivity promised by ‘the AI Scientist’ may sound appealing to some,” they tell IEEE Spectrum, “producing papers and producing knowledge are not the same, and forgetting this distinction risks that we produce more while understanding less.”

But others see the AI scientist as a step in the right direction. SonyAI’s Besold says he believes it’s a great example of how today’s AI can support scientific research when applied to the right domain and tasks. “This may become one of a handful of early prototypes that can help people conceptualize what is possible when AI is applied to the world of scientific discovery,” he says.

What’s Next for the AI Scientist

Lu says that the team plans to keep developing the AI scientist, and he says there’s plenty of low-hanging fruit as they seek to improve its performance. As for whether such AI tools will end up playing an important role in the scientific process, “I think time will tell what these models are good for,” Lu says. It might be, he says, that such tools are useful for the early scoping stages of a research project, when an investigator is trying to get a sense of the many possible research directions—although critics add that we’ll have to wait for future studies to see if these tools are really comprehensive and unbiased enough to be helpful.

Or, Lu says, if the models can be improved to the point that they match the performance of “a solid third-year Ph.D. student,” they could be a force multiplier for anyone trying to pursue an idea (at least, as long as the idea is in an AI-suitable domain). “At that point, anyone can be a professor and carry out a research agenda,” says Lu. “That’s the exciting prospect that I’m looking forward to.”

Unlocking the Brain’s “Neural Code” Could Lead to Superhuman AI

9 September 2024 at 15:11
This shows a robot face.Researchers believe that cracking the brain's "neural code" could lead to AI surpassing human intelligence in capacity and speed. This neural code refers to how the brain processes sensory information and performs cognitive tasks like learning and problem-solving.

Robot Deception: Some Lies Accepted, Others Rejected

6 September 2024 at 22:54
This shows a robot on a park bench.A new study examined how humans perceive different types of deception by robots, revealing that people accept some lies more than others. Researchers presented nearly 500 participants with scenarios where robots engaged in external, hidden, and superficial deceptions in medical, cleaning, and retail settings. Participants disapproved most of hidden deceptions, such as a cleaning robot secretly filming, while external lies, like sparing a patient from emotional pain, were viewed more favorably.

Cops lure pedophiles with AI pics of teen girl. Ethical triumph or new disaster?

6 September 2024 at 22:08
Cops lure pedophiles with AI pics of teen girl. Ethical triumph or new disaster?

Enlarge (credit: Aurich Lawson | Getty Images)

Cops are now using AI to generate images of fake kids, which are helping them catch child predators online, a lawsuit filed by the state of New Mexico against Snapchat revealed this week.

According to the complaint, the New Mexico Department of Justice launched an undercover investigation in recent months to prove that Snapchat "is a primary social media platform for sharing child sexual abuse material (CSAM)" and sextortion of minors, because its "algorithm serves up children to adult predators."

As part of their probe, an investigator "set up a decoy account for a 14-year-old girl, Sexy14Heather."

Read 35 remaining paragraphs | Comments

❌
❌