Artificial General Intelligence (AGI) is the next frontier of AI. Commonly defined as technology that can match the abilities of humans at most tasks, the big question is When will it be possible and how will we be able to evaluate it?.
As the sophistication of AI continues to climb levels, thanks to faster computers, better algorithms and more data, timelines have compressed. Leaders at major AI labs, including OpenAI, Anthropic, and Google DeepMind, expect AGI within a few years.
How to measure General Artificial Intelligence
A computer system that thinks like we do will facilitate closer collaboration with humans. The immediate and long-term impacts of AI, if achieved, are unclear, but changes are expected across the board, from economics to scientific discovery to geopolitics.
And if AI definitely leads to superintelligence, it could even affect humanity’s position in the predatory hierarchy. Therefore, it is imperative that we closely monitor the progress of technology to prepare for such disruption. Evaluating the capabilities of General Artificial Intelligence will allow defining legal regulations, engineering objectives, social norms and business models, as well as understanding intelligence more broadly.
While assessing any intellectual ability is difficult, doing so in the case of general AI presents special challenges. This is due, in part, to the fact that there are strong discrepancies in its definition: some define general AI by its performance in benchmarks, others by its internal functioning, its economic impact or its vibrations. Therefore, the first step in measuring AI intelligence is reach agreement on the general concept.
Another problem is that AI systems have different strengths and weaknesses than humans, so even if we define AGI as “AI that can match humans in most tasks”we can debate which tasks really matter and which humans set the tone. Direct comparisons are very difficult, as Geoffrey Hinton, winner of the Nobel Prize for his work in AI, explained: “We are building extraterrestrial beings”.
Designing and proposing tests that can shed light on our future is something some researchers are busy doing, but one question remains: Can these tests tell us if we have achieved the coveted goal of the IAG?
Why is it so difficult to assess intelligence?
There are infinite types of intelligence, even in humans. IQ tests provide a kind of statistical summary by including a range of semi-related tasks involving memory, logic, spatial processing, mathematics, and vocabulary. From a different perspective, performance on each task is based on a combination of what is called fluid intelligence (on-the-fly reasoning) and crystallized intelligence (application of learned knowledge or skills).
For citizens of first-world countries, IQ tests often predict key outcomes, such as academic and career success. However, we can’t make the same assumptions about AIwhose abilities are not grouped in the same way. An IQ test designed for humans might not say the same about a machine as it does about a person.
There are other types of intelligence that are not typically assessed by IQ tests, and that are even further outside the scope of most AI parameters. These include social intelligence, such as the ability to make psychological inferences, and physical intelligence, such as understanding causal relationships between objects and forces, or the ability to coordinate a body in an environment. Both are crucial for humans facing complex situations.
Assessing intelligence is difficultboth in people and in animals or machines. And you have to be careful with false positives and false negatives. It is also difficult because notions of intelligence vary by place and time, including the changes that occur in societies and the understanding of what is truly important.
AI testing
Over the years, many people have presented machines with grand challenges that purported to require intelligence on par with our own. In 1950, Alan Turing, considered the “father” of computer science, the precursor to modern computing, proposed a game that evaluated a machine’s ability to exhibit intelligent behavior similar to or indistinguishable from that of a human being. For decades, passing what is now known as the ‘test the Turing’ It was considered an almost impossible challenge and a strong indicator of IAG.
Already in the 1960s, researchers described chess as the intellectual game par excellence and thought that the design of a successful chess machine would be a great starting point. Some of this came to fruition in 1997 when the Deep Blue machine defeated Garry Kasparov, the world chess champion at the time. And the IBM machine lacked the general intelligence to even play a simple game of checkers.
Another breakthrough for AI testing came in 2019 when François Chollet, then a software engineer at Google, published an article titled «On the measurement of intelligence». As a complement, he created the new benchmark called ARC to try to measure Artificial General Intelligence. It included hundreds of visual exercises, each with several demonstrations and a test. A demo consists of an input grid and an output grid, both with colored squares. The test only has one input grid. The challenge is to learn a rule from the proofs and apply it in the test, thus creating a new output grid.
So that it is not a test of stored knowledge, but rather how it is recombined, training puzzles must provide all the necessary background knowledge. These include concepts like object cohesion, symmetry, and counting—a young child’s common sense. Humans can solve most puzzles with ease, but the AI struggled, at least at first.
Last March Chollet presented a more difficult version, called ARC-AGI-2. The average human score is 60 percent, while the best AI score is around 16 percent. ARC is considered a great theoretical reference point that can shed light on how algorithms work, but it does not take into account the real complexity of AI applicationssuch as social reasoning tasks. Hence, other researchers, instead of benchmarks, prefer to observe the scientific discoveries that AI is capable of making and the jobs they automate.
General-Bench is another reference benchmark. Uses five input modalities (text, images, video, audio, and 3D) to test AI systems on hundreds of demanding tasks. recognition, reasoning, creativity, ethical judgment and other abilities to understand and generate material. Ideally, a general AI would exhibit synergy, leveraging the capabilities of different tasks to outperform the best AI specialists. However, currently, no AI can even manage all five modalities.
The summary is that it is extremely difficult to evaluate these capabilities and much more to know when we will have achieved that General Artificial Intelligence or the ability to match the abilities of humans in most tasks. And everything that this entails in multiple fields.