Reading view

There are new articles available, click to refresh the page.

Palmer Luckey on the Pentagon’s future of mixed reality

Palmer Luckey has, in some ways, come full circle. 

His first experience with virtual-reality headsets was as a teenage lab technician at a defense research center in Southern California, studying their potential to curb PTSD symptoms in veterans. He then built Oculus, sold it to Facebook for $2 billion, left Facebook after a highly public ousting, and founded Anduril, which focuses on drones, cruise missiles, and other AI-enhanced technologies for the US Department of Defense. The company is now valued at $14 billion.

Now Luckey is redirecting his energy again, to headsets for the military. In September, Anduril announced it would partner with Microsoft on the US Army’s Integrated Visual Augmentation System (IVAS), arguably the military’s largest effort to develop a headset for use on the battlefield. Luckey says the IVAS project is his top priority at Anduril.

“There is going to be a heads-up display on every soldier within a pretty short period of time,” he told MIT Technology Review in an interview last week on his work with the IVAS goggles. “The stuff that we’re building—it’s going to be a big part of that.”

Though few would bet against Luckey’s expertise in the realm of mixed reality, few observers share his optimism for the IVAS program. They view it, thus far, as an avalanche of failures. 

IVAS was first approved in 2018 as an effort to build state-of-the-art mixed-reality headsets for soldiers. In March 2021, Microsoft was awarded nearly $22 billion over 10 years to lead the project, but it quickly became mired in delays. Just a year later, a Pentagon audit criticized the program for not properly testing the goggles, saying its choices “could result in wasting up to $21.88 billion in taxpayer funds to field a system that soldiers may not want to use or use as intended.” The first two variants of the goggles—of which the army purchased 10,000 units—gave soldiers nausea, neck pain, and eye strain, according to internal documents obtained by Bloomberg. 

Such reports have left IVAS on a short leash with members of the Senate Armed Services Committee, which helps determine how much money should be spent on the program. In a subcommittee meeting in May, Senator Tom Cotton, an Arkansas Republican and ranking member, expressed frustration at IVAS’s slow pace and high costs, and in July the committee suggested a $200 million cut to the program. 

Meanwhile, Microsoft has for years been cutting investments into its HoloLens headset—the hardware on which the IVAS program is based—for lack of adoption. In June, Microsoft announced layoffs to its HoloLens teams, suggesting the project is now focused solely on serving the Department of Defense. The company received a serious blow in August, when reports revealed that the Army is considering reopening bidding for the contract to oust Microsoft entirely. 

This is the catastrophe that Luckey’s stepped into. Anduril’s contribution to the project will be Lattice, an AI-powered system that connects everything from drones to radar jammers to surveil, detect objects, and aid in decision-making. Lattice is increasingly becoming Anduril’s flagship offering. It’s a tool that allows soldiers to receive instantaneous information not only from Anduril’s hardware, but also from radars, vehicles, sensors, and other equipment not made by Anduril. Now it will be built into the IVAS goggles. “It’s not quite a hive mind, but it’s certainly a hive eye” is how Luckey described it to me. 

Palmer Luckey holding an autonomous drone interceptor
Anvil, seen here held by Luckey in Anduril’s Costa Mesa Headquarters, integrates with the Lattice OS and can navigate autonomously to intercept hostile drones.
PHILIP CHEUNG

Boosted by Lattice, the IVAS program aims to produce a headset that can help soldiers “rapidly identify potential threats and take decisive action” on the battlefield, according to the Army. If designed well, the device will automatically sort through countless pieces of information—drone locations, vehicles, intelligence—and flag the most important ones to the wearer in real time. 

Luckey defends the IVAS program’s bumps in the road as exactly what one should expect when developing mixed reality for defense. “None of these problems are anything that you would consider insurmountable,” he says. “It’s just a matter of if it’s going to be this year or a few years from now.” He adds that delaying a product is far better than releasing an inferior product, quoting Shigeru Miyamoto, the game director of Nintendo: “A delayed game is delayed only once, but a bad game is bad forever.”

He’s increasingly convinced that the military, not consumers, will be the most important testing ground for mixed-reality hardware: “You’re going to see an AR headset on every soldier, long before you see it on every civilian,” he says. In the consumer world, any headset company is competing with the ubiquity and ease of the smartphone, but he sees entirely different trade-offs in defense.

“The gains are so different when we talk about life-or-death scenarios. You don’t have to worry about things like ‘Oh, this is kind of dorky looking,’ or ‘Oh, you know, this is slightly heavier than I would prefer,’” he says. “Because the alternatives of, you know, getting killed or failing your mission are a lot less desirable.”

Those in charge of the IVAS program remain steadfast in the expectation that it will pay off with huge gains for those on the battlefield. “If it works,” James Rainey, commanding general of the Army Futures Command, told the Armed Services Committee in May, “it is a legitimate 10x upgrade to our most important formations.” That’s a big “if,” and one that currently depends on Microsoft’s ability to deliver. Luckey didn’t get specific when I asked if Anduril was positioning itself to bid to become IVAS’s primary contractor should the opportunity arise. 

If that happens, US troops may, willingly or not, become the most important test subjects for augmented- and virtual-reality technology as it is developed in the coming decades. The commercial sector doesn’t have thousands of individuals within a single institution who can test hardware in physically and mentally demanding situations and provide their feedback on how to improve it. 

That’s one of the ways selling to the defense sector is very different from selling to consumers, Luckey says: “You don’t actually have to convince every single soldier that they personally want to use it. You need to convince the people in charge of him, his commanding officer, and the people in charge of him that this is a thing that is worth wearing.” The iterations that eventually come from IVAS—if it keeps its funding—could signal what’s coming next for the commercial market. 

When I asked Luckey if there were lessons from Oculus he had to unlearn when working with the Department of Defense, he said there’s one: worrying about budgets. “I prided myself for years, you know—I’m the guy who’s figured out how to make VR accessible to the masses by being absolutely brutal at every part of the design process, trying to get costs down. That isn’t what the DOD wants,” he says. “They don’t want the cheapest headset in a vacuum. They want to save money, and generally, spending a bit more money on a headset that is more durable or that has better vision—and therefore allows you to complete a mission faster—is definitely worth the extra few hundred dollars.”

I asked if he’s impressed by the progress that’s been made during his eight-year hiatus from mixed reality. Since he left Facebook in 2017, Apple, Magic Leap, Meta, Snap, and a cascade of startups have been racing to move the technology from the fringe to the mainstream. Everything in mixed reality is about trade-offs, he says. Would you like more computing power, or a lighter and more comfortable headset? 

With more time at Meta, “I would have made different trade-offs in a way that I think would have led to greater adoption,” he says. “But of course, everyone thinks that.” While he’s impressed with the gains, “having been on the inside, I also feel like things could be moving faster.”

Years after leaving, Luckey remains noticeably annoyed by one specific decision he thinks Meta got wrong: not offloading the battery. Dwelling on technical details is unsurprising from someone who spent his formative years living in a trailer in his parents’ driveway posting in obscure forums and obsessing over goggle prototypes. He pontificated on the benefits of packing the heavy batteries and chips in removable pucks that the user could put in a pocket, rather than in the headset itself. Doing so makes the headset lighter and more comfortable. He says he was pushing Facebook to go that route before he was ousted, but when he left, it abandoned the idea. Apple chose to have an external battery for its Vision Pro, which Luckey praised. 

“Anyway,” he told me. “I’m still sore about it eight years later.”

Speaking of soreness, Luckey’s most public professional wound, his ouster from Facebook in 2017, was partially healed last month. The story—involving countless Twitter threads, doxxing, retractions and corrections to news articles, suppressed statements, and a significant segment in Blake Harris’s 2020 book The History of the Future—is difficult to boil down. But here’s the short version: A donation by Luckey to a pro-Trump group called Nimble America in late 2016 led to turmoil within Facebook after it was reported by the Daily Beast. That turmoil grew, especially after Ars Technica wrote that his donation was funding racist memes (the founders of Nimble America were involved in the subreddit r/TheDonald, but the organization itself was focused on creating pro-Trump billboards). Luckey left in March 2017, but Meta has never disclosed why. 

This April, Oculus’s former CTO John Carmack posted on X that he regretted not supporting Luckey more. Meta’s CTO, Andrew Bosworth, argued with Carmack, largely siding with Meta. In response, Luckey said, “You publicly told everyone my departure had nothing to do with politics, which is absolutely insane and obviously contradicted by reams of internal communications.” The two argued. In the X argument, Bosworth cautioned that there are “limits on what can be said here,” to which Luckey responded, “I am down to throw it all out there. We can make everything public and let people judge for themselves. Just say the word.” 

Six months later, Bosworth apologized to Luckey for the comments. Luckey responded, writing that although he is “infamously good at holding grudges,” neither Bosworth nor current leadership at Meta was involved in the incident. 

By now Luckey has spent years mulling over how much of his remaining anger is irrational or misplaced, but one thing is clear. He has a grudge left, but it’s against people behind the scenes—PR agents, lawyers, reporters—who, from his perspective, created a situation that forced him to accept and react to an account he found totally flawed. He’s angry about the steps Facebook took to keep him from communicating his side (Luckey has said he wrote versions of a statement at the time but that Facebook threatened further escalation if he posted it).

“What am I actually angry at? Am I angry that my life went in that direction? Absolutely,” he says.

“I have a lot more anger for the people who lied in a way that ruined my entire life and that saw my own company ripped out from under me that I’d spent my entire adult life building,” he says. “I’ve got plenty of anger left, but it’s not at Meta, the corporate entity. It’s not at Zuck. It’s not at Boz. Those are not the people who wronged me.”

While various subcommittees within the Senate and House deliberate how many millions to spend on IVAS each year, what is not in question is the Pentagon is investing to prepare for a potential conflict in the Pacific between China and Taiwan. The Pentagon requested nearly $10 billion for the Pacific Deterrence Initiative in its latest budget. The prospect of such a conflict is something Luckey considers often. 

He told the authors of Unit X: How the Pentagon and Silicon Valley Are Transforming the Future of War that Anduril’s “entire internal road map” has been organized around the question “How do you deter China? Not just in Taiwan, but Taiwan and beyond?”

At this point, nothing about IVAS is geared specifically toward use in the South Pacific as opposed to Ukraine or anywhere else. The design is in early stages. According to transcripts of a Senate Armed Services Subcommittee meeting in May, the military was scheduled to receive the third iteration of IVAS goggles earlier this summer. If they were on schedule, they’re currently in testing. That version is likely to change dramatically before it approaches Luckey’s vision for the future of mixed-reality warfare, in which “you have a little bit of an AI guardian angel on your shoulder, helping you out and doing all the stuff that is easy to miss in the midst of battle.”

Palmer Luckey sitting on yellow metal staircase
Designs for IVAS will have to adapt amid a shifting landscape of global conflict.
PHILIP CHEUNG

But will soldiers ever trust such a “guardian angel”? If the goggles of the future rely on AI-powered software like Lattice to identify threats—say, an enemy drone ahead or an autonomous vehicle racing toward you—Anduril is making the promise that it can sort through the false positives, recognize threats with impeccable accuracy, and surface critical information when it counts most. 

Luckey says the real test is how the technology compares with the current abilities of humans. “In a lot of cases, it’s already better,” he says, referring to Lattice, as measured by Anduril’s internal tests (it has not released these, and they have not been assessed by any independent external experts). “People are fallible in ways that machines aren’t necessarily,” he adds.

Still, Luckey admits he does worry about the threats Lattice will miss.

“One of the things that really worries me is there’s going to be people who die because Lattice misunderstood something, or missed a threat to a soldier that it should have seen,” he says. “At the same time, I can recognize that it’s still doing far better than people are doing today.”

When Lattice makes a significant mistake, it’s unlikely the public will know. Asked about the balance between transparency and national security in disclosing these errors, Luckey said that Anduril’s customer, the Pentagon, will receive complete information about what went wrong. That’s in line with the Pentagon’s policies on responsible AI adoption, which require that AI-driven systems be “developed with methodologies, data sources, design procedures, and documentation that are transparent to and auditable by their relevant defense personnel.” 

However, the policies promise nothing about disclosure to the public, a fact that’s led some progressive think tanks, like the Brennan Center for Justice, to call on federal agencies to modernize public transparency efforts for the age of AI. 

“It’s easy to say, Well, shouldn’t you be honest about this failure of your system to detect something?” Luckey says, regarding Anduril’s obligations. “Well, what if the failure was because the Chinese figured out a hole in the system and leveraged that to speed past our defenses of some military base? I’d say there’s not very much public good served in saying, ‘Attention, everyone—there is a way to get past all of the security on every US military base around the world.’ I would say that transparency would be the worst thing you could do.”

OpenAI released its advanced voice mode to more people. Here’s how to get it.

OpenAI is broadening access to Advanced Voice Mode, a feature of ChatGPT that allows you to speak more naturally with the AI model. It allows you to interrupt its responses midsentence, and it can sense and interpret your emotions from your tone of voice and adjust its responses accordingly. 

These features were teased back in May when OpenAI unveiled GPT-4o, but they were not released until July—and then just to an invite-only group. (At least initially, there seem to have been some safety issues with the model; OpenAI gave several Wired reporters access to the voice mode back in May, but the magazine reported that the company “pulled it the next morning, citing safety concerns.”)

Users who’ve been able to try it have largely described the model as an impressively fast, dynamic, and realistic voice assistant—which has made its limited availability particularly frustrating to some other OpenAI users. 

Today is the first time OpenAI has promised to bring the new voice mode to a wide range of users. Here’s what you need to know.

What can it do? 

Though ChatGPT currently offers a standard voice mode to paid users, its interactions can be clunky. In the mobile app, for example, you can’t interrupt the model’s often long-winded responses with your voice, only with a tap on the screen. The new version fixes that, and also promises to modify its responses on the basis of the emotion it’s sensing from your voice. As with other versions of ChatGPT, users can personalize the voice mode by asking the model to remember facts about themselves. The new mode also has improved its pronunciation of words in non-English languages.

AI investor Allie Miller posted a demo of the tool in August, which highlighted a lot of the same strengths of OpenAI’s own release videos: The model is fast and adept at changing its accent, tone, and content to match your needs.

I’m testing the new @OpenAI Advanced Voice Mode and I just snorted with laughter.

In a good way.

Watch the whole thing ⬇ pic.twitter.com/vSOMzXdwZo

— Allie K. Miller (@alliekmiller) August 2, 2024

The update also adds new voices. Shortly after the launch of GPT-4o, OpenAI was criticized for the similarity between the female voice in its demo videos, named Sky, and that of Scarlett Johansson, who played an AI love interest in the movie Her. OpenAI then removed the voice.

Now it has launched five new voices, named Arbor, Maple, Sol, Spruce, and Vale, which will be available in both the standard and advanced voice modes. MIT Technology Review has not heard them yet, but OpenAI says they were made using professional voice actors from around the world. “We interviewed dozens of actors to find those with the qualities of voices we feel people will enjoy talking to for hours—warm, approachable, inquisitive, with some rich texture and tone,” a company spokesperson says. 

Who can access it and when?

For now, OpenAI is rolling out access to Advanced Voice Mode to Plus users, who pay $20 per month for a premium version, and Team users, who pay $30 per month and have higher message limits. The next group to receive access will be those in the Enterprise and Edu tiers. The exact timing, though, is vague; an OpenAI spokesperson says the company will “gradually roll out access to all Plus and Team users and will roll out to Enterprise and Edu tiers starting next week.” The company hasn’t committed to a firm deadline for when all users in these categories will have access. A message in the ChatGPT app indicates that all Plus users will have access by “the end of fall.”

There are geographic limitations. The new feature is not yet available in the EU, the UK, Switzerland, Iceland, Norway, or Liechtenstein.

There is no immediate plan to release Advanced Voice Mode to free users. (The standard mode remains available to all paid users.)

What steps have been taken to make sure it’s safe?

As the company noted upon the initial release in July and again emphasized this week, Advanced Voice Mode has been safety-tested by external experts “who collectively speak a total of 45 different languages, and represent 29 different geographies.” The GPT-4o system card details how the underlying model handles issues like generating violent or erotic speech, imitating voices without their consent, or generating copyrighted content. 

Still, OpenAI’s models are not open-source. Compared with such models, which are more transparent about their training data and the “model weights” that govern how the AI produces responses, OpenAI’s closed-source models are harder for independent researchers to evaluate from the perspective of safety, bias, and harm.

An AI script editor could help decide what films get made in Hollywood

Every day across Hollywood, scores of film school graduates and production assistants work as script readers. Their job is to find the diamonds in the rough from the 50,000 or so screenplays pitched each year and flag any worth pursuing further. Each script runs anywhere from 100 to 150 pages, and it can take half a day to read one and write up a “coverage,” or summary of the strengths and weaknesses. With only about 50 of these scripts selling in a given year, readers are trained to be ruthless. 

Now the film-focused tech company Cinelytic, which works with major studios like Warner Bros. and Sony Pictures to analyze film budgets and box office potential, aims to offer script feedback with generative AI. 

Today it launched a new tool called Callaia, which amateur writers and professional script readers alike can use to analyze scripts at $79 each. Using AI, it takes Callaia less than a minute to write its own coverage, which includes a synopsis, a list of comparable films, grades for areas like dialogue and originality, and actor recommendations. It also makes a recommendation on whether or not the film should be financed, giving it a rating of “pass,” “consider,” “recommend,” or “strongly recommend.” Though the foundation of the tool is built with ChatGPT’s API, the team had to coach the model on script-specific tasks like evaluating genres and writing a movie’s logline, which summarize the story in a sentence. 

“It helps people understand the script very quickly,” says Tobias Queisser, Cinelytic’s cofounder and CEO, who also had a career as a film producer. “You can look at more stories and more scripts, and not eliminate them based on factors that are detrimental to the business of finding great content.”

The idea is that Callaia will give studios a more analytical way to predict how a script may perform on the screen before spending on marketing or production. But, the company says, it’s also meant to ease the bottleneck that script readers create in the filmmaking process. With such a deluge to sort through, many scripts can make it to decision-makers only if they have a recognizable name attached. An AI-driven tool would democratize the script selection process and allow better scripts and writers to be discovered, Queisser says.

The tool’s introduction may further fuel the ongoing Hollywood debate about whether AI will help or harm its creatives. Since the public launch of ChatGPT in late 2022, the technology has drawn concern everywhere from writers’ rooms to special effects departments, where people worry that it will cheapen, augment, or replace human talent.  

In this case, Callaia’s success will depend on whether it can provide critical feedback as well as a human script reader can. 

That’s a challenge because of what GPT and other AI models are built to do, according to Tuhin Chakrabarty, a researcher who studied how well AI can analyze creative works during his PhD in computer science at Columbia University. In one of his studies, Chakrabarty and his coauthors had various AI models and a group of human experts—including professors of creative writing and a screenwriter—analyze the quality of 48 stories, 12 that appeared in the New Yorker and the rest of which were AI-generated. His team found that the two groups virtually never agreed on the quality of the works. 

“Whenever you ask an AI model about the creativity of your work, it is never going to say bad things,” Chakrabarty says. “It is always going to say good things, because it’s trained to be a helpful, polite assistant.”

Cinelytic CTO Dev Sen says this trait did present a hurdle in the design of Callaia, and that the initial output of the model was overly positive. That improved with time and tweaking. “We don’t necessarily want to be overly critical, but aim for a more balanced analysis that points out both strengths and weaknesses in the script,” he says. 

Vir Srinivas, an independent filmmaker whose film Orders from Above won Best Historical Film at Cannes in 2021, agreed to look at an example of Callaia’s output to see how well the AI model can analyze a script. I showed him what the model made of a 100-page script about a jazz trumpeter on a journey of self-discovery in San Francisco, which Cinelytic provided. Srinivas says that the coverage generated by the model didn’t go deep enough to present genuinely helpful feedback to a screenwriter.

“It’s approaching the script in too literal a sense and not a metaphorical one—something which human audiences do intuitively and unconsciously,” he says. “It’s as if it’s being forced to be diplomatic and not make any waves.”

There were other flaws, too. For example, Callaia predicted that the film would need a budget of just $5 to $10 million but also suggested that expensive A-listers like Paul Rudd would have been well suited for the lead role.

Cinelytic says it’s currently at work improving the actor recommendation component, and though the company did not provide data on how well its model analyzes a given script, Sen says feedback from 100 script readers who beta-tested the model was overwhelmingly positive. “Most of them were pretty much blown away, because they said that the coverages were on the order of, if not better than, the coverages they’re used to,” he says. 

Overall, Cinelytic is pitching Callaia as a tool meant to quickly provide feedback on lots of scripts, not to replace human script readers, who will still read and adjust the tool’s findings. Queisser, who is cognizant that whether AI can effectively write or edit creatively is hotly contested in Hollywood, is hopeful the tool will allow script readers to more quickly identify standout scripts while also providing an efficient source of feedback for writers.

“Writers that embrace our tool will have something that can help them refine their scripts and find more opportunities,” he says. “It’s positive for both sides.”

Why OpenAI’s new model is such a big deal

This story is from The Algorithm, our weekly newsletter on AI. To get it in your inbox first, sign up here.

Last weekend, I got married at a summer camp, and during the day our guests competed in a series of games inspired by the show Survivor that my now-wife and I orchestrated. When we were planning the games in August, we wanted one station to be a memory challenge, where our friends and family would have to memorize part of a poem and then relay it to their teammates so they could re-create it with a set of wooden tiles. 

I thought OpenAI’s GPT-4o, its leading model at the time, would be perfectly suited to help. I asked it to create a short wedding-themed poem, with the constraint that each letter could only appear a certain number of times so we could make sure teams would be able to reproduce it with the provided set of tiles. GPT-4o failed miserably. The model repeatedly insisted that its poem worked within the constraints, even though it didn’t. It would correctly count the letters only after the fact, while continuing to deliver poems that didn’t fit the prompt. Without the time to meticulously craft the verses by hand, we ditched the poem idea and instead challenged guests to memorize a series of shapes made from colored tiles. (That ended up being a total hit with our friends and family, who also competed in dodgeball, egg tosses, and capture the flag.)    

However, last week OpenAI released a new model called o1 (previously referred to under the code name “Strawberry” and, before that, Q*) that blows GPT-4o out of the water for this type of purpose

Unlike previous models that are well suited for language tasks like writing and editing, OpenAI o1 is focused on multistep “reasoning,” the type of process required for advanced mathematics, coding, or other STEM-based questions. It uses a “chain of thought” technique, according to OpenAI. “It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working,” the company wrote in a blog post on its website.

OpenAI’s tests point to resounding success. The model ranks in the 89th percentile on questions from the competitive coding organization Codeforces and would be among the top 500 high school students in the USA Math Olympiad, which covers geometry, number theory, and other math topics. The model is also trained to answer PhD-level questions in subjects ranging from astrophysics to organic chemistry. 

In math olympiad questions, the new model is 83.3% accurate, versus 13.4% for GPT-4o. In the PhD-level questions, it averaged 78% accuracy, compared with 69.7% from human experts and 56.1% from GPT-4o. (In light of these accomplishments, it’s unsurprising the new model was pretty good at writing a poem for our nuptial games, though still not perfect; it used more Ts and Ss than instructed to.)

So why does this matter? The bulk of LLM progress until now has been language-driven, resulting in chatbots or voice assistants that can interpret, analyze, and generate words. But in addition to getting lots of facts wrong, such LLMs have failed to demonstrate the types of skills required to solve important problems in fields like drug discovery, materials science, coding, or physics. OpenAI’s o1 is one of the first signs that LLMs might soon become genuinely helpful companions to human researchers in these fields. 

It’s a big deal because it brings “chain-of-thought” reasoning in an AI model to a mass audience, says Matt Welsh, an AI researcher and founder of the LLM startup Fixie. 

“The reasoning abilities are directly in the model, rather than one having to use separate tools to achieve similar results. My expectation is that it will raise the bar for what people expect AI models to be able to do,” Welsh says.

That said, it’s best to take OpenAI’s comparisons to “human-level skills” with a grain of salt, says Yves-Alexandre de Montjoye, an associate professor in math and computer science at Imperial College London. It’s very hard to meaningfully compare how LLMs and people go about tasks such as solving math problems from scratch.

Also, AI researchers say that measuring how well a model like o1 can “reason” is harder than it sounds. If it answers a given question correctly, is that because it successfully reasoned its way to the logical answer? Or was it aided by a sufficient starting point of knowledge built into the model? The model “still falls short when it comes to open-ended reasoning,” Google AI researcher François Chollet wrote on X.

Finally, there’s the price. This reasoning-heavy model doesn’t come cheap. Though access to some versions of the model is included in premium OpenAI subscriptions, developers using o1 through the API will pay three times as much as they pay for GPT-4o—$15 per 1 million input tokens in o1, versus $5 for GPT-4o. The new model also won’t be most users’ first pick for more language-heavy tasks, where GPT-4o continues to be the better option, according to OpenAI’s user surveys. 

What will it unlock? We won’t know until researchers and labs have the access, time, and budget to tinker with the new mode and find its limits. But it’s surely a sign that the race for models that can outreason humans has begun. 

Now read the rest of The Algorithm


Deeper learning

Chatbots can persuade people to stop believing in conspiracy theories

Researchers believe they’ve uncovered a new tool for combating false conspiracy theories: AI chatbots. Researchers from MIT Sloan and Cornell University found that chatting about a conspiracy theory with a large language model (LLM) reduced people’s belief in it by about 20%—even among participants who claimed that their beliefs were important to their identity. 

Why this matters: The findings could represent an important step forward in how we engage with and educate people who espouse such baseless theories, says Yunhao (Jerry) Zhang, a postdoc fellow affiliated with the Psychology of Technology Institute who studies AI’s impacts on society. “They show that with the help of large language models, we can—I wouldn’t say solve it, but we can at least mitigate this problem,” he says. “It points out a way to make society better.” Read more from Rhiannon Williams here.

Bits and bytes

Google’s new tool lets large language models fact-check their responses

Called DataGemma, it uses two methods to help LLMs check their responses against reliable data and cite their sources more transparently to users. (MIT Technology Review)

Meet the radio-obsessed civilian shaping Ukraine’s drone defense 

Since Russia’s invasion, Serhii “Flash” Beskrestnov has become an influential, if sometimes controversial, force—sharing expert advice and intel on the ever-evolving technology that’s taken over the skies. His work may determine the future of Ukraine, and wars far beyond it. (MIT Technology Review)

Tech companies have joined a White House commitment to prevent AI-generated sexual abuse imagery

The pledges, signed by firms like OpenAI, Anthropic, and Microsoft, aim to “curb the creation of image-based sexual abuse.” The companies promise to set limits on what models will generate and to remove nude images from training data sets where possible.  (Fortune)

OpenAI is now valued at $150 billion

The valuation arose out of talks it’s currently engaged in to raise $6.5 billion. Given that OpenAI is becoming increasingly costly to operate, and could lose as much as $5 billion this year, it’s tricky to see how it all adds up. (The Information)

Google’s new tool lets large language models fact-check their responses

As long as chatbots have been around, they have made things up. Such “hallucinations” are an inherent part of how AI models work. However, they’re a big problem for companies betting big on AI, like Google, because they make the responses it generates unreliable. 

Google is releasing a tool today to address the issue. Called DataGemma, it uses two methods to help large language models fact-check their responses against reliable data and cite their sources more transparently to users. 

The first of the two methods is called Retrieval-Interleaved Generation (RIG), which acts as a sort of fact-checker. If a user prompts the model with a question—like “Has the use of renewable energy sources increased in the world?”—the model will come up with a “first draft” answer. Then RIG identifies what portions of the draft answer could be checked against Google’s Data Commons, a massive repository of data and statistics from reliable sources like the United Nations or the Centers for Disease Control and Prevention. Next, it runs those checks and replaces any incorrect original guesses with correct facts. It also cites its sources to the user.

The second method, which is commonly used in other large language models, is called Retrieval-Augmented Generation (RAG). Consider a prompt like “What progress has Pakistan made against global health goals?” In response, the model examines which data in the Data Commons could help it answer the question, such as information about access to safe drinking water, hepatitis B immunizations, and life expectancies. With those figures in hand, the model then builds its answer on top of the data and cites its sources.

“Our goal here was to use Data Commons to enhance the reasoning of LLMs by grounding them in real-world statistical data that you could source back to where you got it from,” says Prem Ramaswami, head of Data Commons at Google. Doing so, he says, will “create more trustable, reliable AI.”

It is only available to researchers for now, but Ramaswami says access could widen further after more testing. If it works as hoped, it could be a real boon for Google’s plan to embed AI deeper into its search engine.  

However, it comes with a host of caveats. First, the usefulness of the methods is limited by whether the relevant data is in the Data Commons, which is more of a data repository than an encyclopedia. It can tell you the GDP of Iran, but it’s unable to confirm the date of the First Battle of Fallujah or when Taylor Swift released her most recent single. In fact, Google’s researchers found that with about 75% of the test questions, the RIG method was unable to obtain any usable data from the Data Commons. And even if helpful data is indeed housed in the Data Commons, the model doesn’t always formulate the right questions to find it. 

Second, there is the question of accuracy. When testing the RAG method, researchers found that the model gave incorrect answers 6% to 20% of the time. Meanwhile, the RIG method pulled the correct stat from Data Commons only about 58% of the time (though that’s a big improvement over the 5% to 17% accuracy rate of Google’s large language models when they’re not pinging Data Commons). 

Ramaswami says DataGemma’s accuracy will improve as it gets trained on more and more data. The initial version has been trained on only about 700 questions, and fine-tuning the model required his team to manually check each individual fact it generated. To further improve the model, the team plans to increase that data set from hundreds of questions to millions.

❌