Introduction

Two prominent figures in San Francisco's artificial intelligence landscape are pushing the boundaries of AI testing. Scale AI, a company specializing in data preparation for LLMs, and the Center for AI Safety (CAIS) have launched an initiative called "Humanity's Last Exam." This project challenges the public to devise questions that can effectively test the capabilities of advanced LLMs like Google Gemini and OpenAI's o1.

The initiative offers prizes of $5,000 for the top 50 questions selected for the test. Scale AI and CAIS aim to gauge how close we are to achieving "expert-level AI systems" by engaging a broad range of experts.

The Need for New Tests

Leading LLMs are increasingly proficient at established tests in areas like intelligence, mathematics, and law. However, the significance of these achievements is questionable. Due to the massive datasets used for training, LLMs may be pre-learning answers, essentially memorizing solutions rather than demonstrating genuine understanding.

Data is crucial to the shift from traditional computing to AI. Instead of explicitly programming instructions, AI systems learn from data. This requires high-quality training datasets and robust testing methodologies. Developers typically use "test datasets," which consist of data not used during the training phase, to evaluate the performance of AI models.

As AI analytics site Epoch AI estimates, by 2028, AIs may have effectively processed all human-written text. This raises the critical question of how to continuously assess AIs once they have access to virtually all existing information. If LLMs aren't already pre-learning answers to standardized tests, they likely will be soon.

The Challenge of Model Collapse

The continuous expansion of the internet, with millions of new items added daily, might seem like a solution to the pre-learning problem. However, this leads to another challenge known as "model collapse." As AI-generated content increasingly populates the internet and is subsequently used in future AI training sets, the performance of AIs may degrade. This is because the AI is essentially learning from its own outputs, which can amplify errors and biases.

To mitigate model collapse, many developers are collecting data from human interactions with AIs, adding fresh, real-world data for training and testing. This approach helps to keep the training data relevant and diverse.

Some experts suggest that AIs need to become embodied, interacting with the real world and gaining experiences similar to humans. This concept might seem futuristic, but companies like Tesla are already implementing it with their autonomous vehicles. Another avenue involves human wearables, such as Meta's Ray-Ban smart glasses, which can collect vast amounts of human-centric video and audio data.

The Limitations of Narrow Tests

Even if sufficient training data is available, defining and measuring intelligence, particularly artificial general intelligence (AGI), remains a significant challenge. AGI refers to AI that matches or surpasses human intelligence across a wide range of tasks.

Traditional human IQ tests have been criticized for failing to capture the multifaceted nature of intelligence, which includes language, mathematics, empathy, and spatial reasoning.

A similar issue exists with AI tests. Many established tests evaluate specific tasks such as text summarization, comprehension, inference, human pose recognition, and machine vision. While these tests are valuable, they often provide a narrow view of overall intelligence.

Some tests are being retired because AIs perform too well on them. However, these tests are often task-specific and don't reflect broader intelligence. For example, the chess-playing AI Stockfish significantly outperforms Magnus Carlsen, the highest-rated human player, on the Elo rating system. Yet, Stockfish cannot perform other tasks, such as understanding language. Therefore, equating its chess capabilities with general intelligence would be misleading.

As AIs demonstrate increasingly broad intelligent behavior, the challenge lies in developing new benchmarks for comparing and measuring their progress. One notable approach comes from Google engineer François Chollet, who argues that true intelligence lies in the ability to adapt and generalize learning to new, unseen situations.

In 2019, Chollet introduced the "abstraction and reasoning corpus" (ARC), a collection of visual grid puzzles designed to test an AI's ability to infer and apply abstract rules. Unlike traditional benchmarks that train AIs on millions of labeled images, ARC provides minimal examples, forcing the AI to deduce the underlying logic.

The Abstraction and Reasoning Corpus (ARC)

The ARC dataset presents AI systems with a series of visual puzzles. Each puzzle consists of a set of input-output pairs. The AI must analyze these pairs, identify the underlying pattern or rule, and then apply that rule to a new input to generate the correct output.

For example, a puzzle might show a series of grids where a specific shape is consistently transformed in a particular way. The AI must learn this transformation and apply it to a new grid with a different initial shape.

The key challenge of ARC is that the AI cannot simply memorize solutions. It must understand the abstract principles governing the transformations and apply them in novel situations. This requires a level of reasoning and generalization that is beyond the capabilities of many current AI systems.

Though humans generally find ARC puzzles relatively easy, achieving high scores on ARC remains a significant challenge for AI. There is a $600,000 prize for the first AI system to achieve a score of 85 percent on the ARC benchmark. Currently, leading LLMs like OpenAI's o1 preview and Anthropic's Sonnet 3.5 score around 21 percent on the ARC-AGI-Pub leaderboard. A more recent attempt using OpenAI's GPT-4o scored 50 percent, but this involved generating thousands of possible solutions before selecting the best one, a method that raises questions about the validity of the result.

The Future of AI Testing

While ARC represents a significant step forward in testing AI intelligence, the Scale/CAIS initiative highlights the ongoing search for even more effective methods. The fact that the prize-winning questions from "Humanity's Last Exam" will not be published online underscores the importance of preventing AIs from simply learning the test questions.

It is crucial to accurately assess when machines are approaching human-level reasoning, given the significant safety, ethical, and moral implications. Eventually, we will face the even more daunting challenge of testing for superintelligence, a task that requires profound conceptual breakthroughs.

Humanity's Last Exam https://arxiv.org/abs/2501.14249