Ars Technica
New secret math benchmark stumps AI models and PhDs alike
12 November 2024 at 23:49

New secret math benchmark stumps AI models and PhDs alike

By: Benj Edwards

12 November 2024 at 23:49

On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI. The benchmark tests AI language models (such as GPT-4o, which powers ChatGPT) against original mathematics problems that typically require hours or days for specialist mathematicians to complete.

FrontierMath's performance results, revealed in a preprint research paper, paint a stark picture of current AI model limitations. Even with access to Python environments for testing and verification, top models like Claude 3.5 Sonnet, GPT-4o, o1-preview, and Gemini 1.5 Pro scored extremely poorly. This contrasts with their high performance on simpler math benchmarks—many models now score above 90 percent on tests like GSM8K and MATH.

The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination. Many existing AI models are trained on other test problem datasets, allowing the AI models to easily solve the problems and appear more generally capable than they actually are. Many experts cite this as evidence that current large language models (LLMs) are poor generalist learners.

Read full article

Comments

Ars Technica
Ars Live: Our first encounter with manipulative AI
12 November 2024 at 14:17

Ars Live: Our first encounter with manipulative AI

Ars Technica

By: Benj Edwards

12 November 2024 at 14:17

In the short-term, the most dangerous thing about AI language models may be their ability to emotionally manipulate humans if not carefully conditioned. The world saw its first taste of that potential danger in February 2023 with the launch of Bing Chat, now called Microsoft Copilot.

During its early testing period, the temperamental chatbot gave the world a preview of an "unhinged" version of OpenAI's GPT-4 prior to its official release. Sydney's sometimes uncensored and "emotional" nature (including use of emojis) arguably gave the world its first large-scale encounter with a truly manipulative AI system. The launch set off alarm bells in the AI alignment community and served as fuel for prominent warning letters about AI dangers.

On November 19 at 4 pm Eastern (1 pm Pacific), Ars Technica Senior AI Reporter Benj Edwards will host a livestream conversation on YouTube with independent AI researcher Simon Willison that will explore the impact and fallout of the 2023 fiasco. We're calling it "Bing Chat: Our First Encounter with Manipulative AI."

Read full article

Comments

Ars Technica
Is “AI welfare” the new frontier in ethics?
11 November 2024 at 16:51

Is “AI welfare” the new frontier in ethics?

Ars Technica

By: Benj Edwards

11 November 2024 at 16:51

A few months ago, Anthropic quietly hired its first dedicated "AI welfare" researcher, Kyle Fish, to explore whether future AI models might deserve moral consideration and protection, reports AI newsletter Transformer. While sentience in AI models is an extremely controversial and contentious topic, the hire could signal a shift toward AI companies examining ethical questions about the consciousness and rights of AI systems.

Fish joined Anthropic's alignment science team in September to develop guidelines for how Anthropic and other companies should approach the issue. The news follows a major report co-authored by Fish before he landed his Anthropic role. Titled "Taking AI Welfare Seriously," the paper warns that AI models could soon develop consciousness or agency—traits that some might consider requirements for moral consideration. But the authors do not say that AI consciousness is a guaranteed future development.

"To be clear, our argument in this report is not that AI systems definitely are—or will be—conscious, robustly agentic, or otherwise morally significant," the paper reads. "Instead, our argument is that there is substantial uncertainty about these possibilities, and so we need to improve our understanding of AI welfare and our ability to make wise decisions about this issue. Otherwise there is a significant risk that we will mishandle decisions about AI welfare, mistakenly harming AI systems that matter morally and/or mistakenly caring for AI systems that do not."

Read full article

Comments

Ars Technica
Claude AI to process secret government data through new Palantir deal
8 November 2024 at 23:08

Claude AI to process secret government data through new Palantir deal

Ars Technica

By: Benj Edwards

8 November 2024 at 23:08

Anthropic has announced a partnership with Palantir and Amazon Web Services to bring its Claude AI models to unspecified US intelligence and defense agencies. Claude, a family of AI language models similar to those that power ChatGPT, will work within Palantir's platform using AWS hosting to process and analyze data. But some critics have called out the deal as contradictory to Anthropic's widely-publicized "AI safety" aims.

On X, former Google co-head of AI ethics Timnit Gebru wrote of Anthropic's new deal with Palantir, "Look at how they care so much about 'existential risks to humanity.'"

The partnership makes Claude available within Palantir's Impact Level 6 environment (IL6), a defense-accredited system that handles data critical to national security up to the "secret" classification level. This move follows a broader trend of AI companies seeking defense contracts, with Meta offering its Llama models to defense partners and OpenAI pursuing closer ties with the Defense Department.

Read full article

Comments

Ars Technica
The voice of America Online’s “You’ve got mail” has died at age 74
8 November 2024 at 17:13

The voice of America Online’s “You’ve got mail” has died at age 74

Ars Technica

By: Benj Edwards

8 November 2024 at 17:13

On Tuesday, Elwood Edwards, the voice behind the online service America Online's iconic "You've got mail" greeting, died at age 74, one day before his 75th birthday, according to Cleveland's WKYC Studios, where he worked for many years. The greeting became a cultural touchstone in the 1990s and early 2000s in the early Internet era; it was heard by hundreds of millions of users when they logged in to the service and new email was waiting for them.

The story of Edwards' famous recording began in 1989 when Steve Case, CEO of Quantum Computer Services (which later became America Online—or AOL for short), wanted to add a human voice to the company's Quantum Link online service. Karen Edwards, who worked as a customer service representative, heard Case discussing the plan and suggested her husband Elwood, a professional broadcaster.

Edwards recorded the famous phrase (and several others) into a cassette recorder in his living room in 1989 and was paid $200 for the service. His voice recordings of "Welcome," "You've got mail," "File's done," and "Goodbye" went on to reach millions of users during AOL's rise to dominance in the 1990s online landscape.

Read full article

Comments

Ars Technica
ChatGPT has a new vanity domain name, and it may have cost $15 million
7 November 2024 at 16:32

ChatGPT has a new vanity domain name, and it may have cost $15 million

Ars Technica

By: Benj Edwards

7 November 2024 at 16:32

On Wednesday, OpenAI CEO Sam Altman merely tweeted "chat.com," announcing that the company had acquired the short domain name, which now points to the company's ChatGPT AI assistant when visited in a web browser. As of Thursday morning, "chatgpt.com" still hosts the chatbot, with the new domain serving as a redirect.

The new domain name comes with an interesting backstory that reveals a multimillion-dollar transaction. HubSpot founder and CTO Dharmesh Shah purchased chat.com for $15.5 million in early 2023, The Verge reports. Shah sold the domain to OpenAI for an undisclosed amount, though he confirmed on X that he "doesn't like profiting off of people he considers friends" and that he received payment in company shares by revealing he is "now an investor in OpenAI."

As The Verge's Kylie Robison points out, Shah originally bought the domain to promote conversational interfaces. "The reason I bought chat.com is simple: I think Chat-based UX (#ChatUX) is the next big thing in software. Communicating with computers/software through a natural language interface is much more intuitive. This is made possible by Generative A.I.," Shah wrote in a LinkedIn post during his brief ownership.

Read full article

Comments

Ars Technica
Trump plans to dismantle Biden AI safeguards after victory
6 November 2024 at 22:18

Trump plans to dismantle Biden AI safeguards after victory

Ars Technica

By: Benj Edwards

6 November 2024 at 22:18

Early Wednesday morning, Donald Trump became the presumptive winner of the 2024 US presidential election, setting the stage for dramatic changes to federal AI policy when he takes office early next year. Among them, Trump has stated he plans to dismantle President Biden's AI Executive Order from October 2023 immediately upon taking office.

Biden's order established wide-ranging oversight of AI development. Among its core provisions, the order established the US AI Safety Institute (AISI) and lays out requirements for companies to submit reports about AI training methodologies and security measures, including vulnerability testing data. The order also directed the Commerce Department's National Institute of Standards and Technology (NIST) to develop guidance to help companies identify and fix flaws in their AI models.

Trump supporters in the US government have criticized the measures, as TechCrunch points out. In March, Representative Nancy Mace (R-S.C.) warned that reporting requirements could discourage innovation and prevent developments like ChatGPT. And Senator Ted Cruz (R-Texas) characterized NIST's AI safety standards as an attempt to control speech through "woke" safety requirements.

Read full article

Comments

Ars Technica
Anthropic’s Haiku 3.5 surprises experts with an “intelligence” price increase
5 November 2024 at 23:50

Anthropic’s Haiku 3.5 surprises experts with an “intelligence” price increase

Ars Technica

By: Benj Edwards

5 November 2024 at 23:50

On Monday, Anthropic launched the latest version of its smallest AI model, Claude 3.5 Haiku, in a way that marks a departure from typical AI model pricing trends—the new model costs four times more to run than its predecessor. The reason for the price increase is causing some pushback in the AI community: more smarts, according to Anthropic.

"During final testing, Haiku surpassed Claude 3 Opus, our previous flagship model, on many benchmarks—at a fraction of the cost," Anthropic wrote in a post on X. "As a result, we've increased pricing for Claude 3.5 Haiku to reflect its increase in intelligence."

"It's your budget model that's competing against other budget models, why would you make it less competitive," wrote one X user. "People wanting a 'too cheap to meter' solution will now look elsewhere."

Read full article

Comments

Ars Technica
Downey Jr. plans to fight AI re-creations from beyond the grave
30 October 2024 at 19:53

Downey Jr. plans to fight AI re-creations from beyond the grave

Ars Technica

By: Benj Edwards

30 October 2024 at 19:53

Robert Downey Jr. has declared that he will sue any future Hollywood executives who try to re-create his likeness using AI digital replicas, as reported by Variety. His comments came during an appearance on the "On With Kara Swisher" podcast, where he discussed AI's growing role in entertainment.

"I intend to sue all future executives just on spec," Downey told Swisher when discussing the possibility of studios using AI or deepfakes to re-create his performances after his death. When Swisher pointed out he would be deceased at the time, Downey responded that his law firm "will still be very active."

The Oscar winner expressed confidence that Marvel Studios would not use AI to re-create his Tony Stark character, citing his trust in decision-makers there. "I am not worried about them hijacking my character's soul because there's like three or four guys and gals who make all the decisions there anyway and they would never do that to me," he said.

Read full article

Comments

Ars Technica
Google CEO says over 25% of new Google code is generated by AI
30 October 2024 at 16:50

Google CEO says over 25% of new Google code is generated by AI

Ars Technica

By: Benj Edwards

30 October 2024 at 16:50

On Tuesday, Google's CEO revealed that AI systems now generate more than a quarter of new code for its products, with human programmers overseeing the computer-generated contributions. The statement, made during Google's Q3 2024 earnings call, shows how AI tools are already having a sizable impact on software development.

"We're also using AI internally to improve our coding processes, which is boosting productivity and efficiency," Pichai said during the call. "Today, more than a quarter of all new code at Google is generated by AI, then reviewed and accepted by engineers. This helps our engineers do more and move faster."

Google developers aren't the only programmers using AI to assist with coding tasks. It's difficult to get hard numbers, but according to Stack Overflow's 2024 Developer Survey, over 76 percent of all respondents "are using or are planning to use AI tools in their development process this year," with 62 percent actively using them. A 2023 GitHub survey found that 92 percent of US-based software developers are "already using AI coding tools both in and outside of work."

Read full article

Comments

Ars Technica
Hospitals adopt error-prone AI transcription tools despite warnings
28 October 2024 at 19:23

Hospitals adopt error-prone AI transcription tools despite warnings

Ars Technica

By: Benj Edwards

28 October 2024 at 19:23

On Saturday, an Associated Press investigation revealed that OpenAI's Whisper transcription tool creates fabricated text in medical and business settings despite warnings against such use. The AP interviewed more than 12 software engineers, developers, and researchers who found the model regularly invents text that speakers never said, a phenomenon often called a "confabulation" or "hallucination" in the AI field.

Upon its release in 2022, OpenAI claimed that Whisper approached "human level robustness" in audio transcription accuracy. However, a University of Michigan researcher told the AP that Whisper created false text in 80 percent of public meeting transcripts examined. Another developer, unnamed in the AP report, claimed to have found invented content in almost all of his 26,000 test transcriptions.

The fabrications pose particular risks in health care settings. Despite OpenAI's warnings against using Whisper for "high-risk domains," over 30,000 medical workers now use Whisper-based tools to transcribe patient visits, according to the AP report. The Mankato Clinic in Minnesota and Children's Hospital Los Angeles count among 40 health systems using a Whisper-powered AI copilot service from medical tech company Nabla that is fine-tuned on medical terminology.

Read full article

Comments

Ars Technica
At TED AI 2024, experts grapple with AI’s growing pains
24 October 2024 at 00:32

At TED AI 2024, experts grapple with AI’s growing pains

Ars Technica

By: Benj Edwards

24 October 2024 at 00:32

SAN FRANCISCO—On Tuesday, TED AI 2024 kicked off its first day at San Francisco's Herbst Theater with a lineup of speakers that tackled AI's impact on science, art, and society. The two-day event brought a mix of researchers, entrepreneurs, lawyers, and other experts who painted a complex picture of AI with fairly minimal hype.

The second annual conference, organized by Walter and Sam De Brouwer, marked a notable shift from last year's broad existential debates and proclamations of AI as being "the new electricity." Rather than sweeping predictions about, say, looming artificial general intelligence (although there was still some of that, too), speakers mostly focused on immediate challenges: battles over training data rights, proposals for hardware-based regulation, debates about human-AI relationships, and the complex dynamics of workplace adoption.

The day's sessions covered a wide breadth of AI topics: physicist Carlo Rovelli explored consciousness and time, Project CETI researcher Patricia Sharma demonstrated attempts to use AI to decode whale communication, Recording Academy CEO Harvey Mason Jr. outlined music industry adaptation strategies, and even a few robots made appearances.

Read full article

Comments

Normal view