Measuring AI by math, coding, science, logic, or other tests, even when it has never seen it before, is like measuring the rising temperature of water by hand. It will be discovered to be hot, but how specific would need a standard.
The only standard for intelligence is human intelligence. Measuring AI by anything other than human intelligence is already off the mark. AI is intelligent. It can already do many things humans can do. Even if it had not seen it before, several solutions are within its province. If a human were to pass a test, which would require thinking and time, how would the mind solve it?
This question assumes that the mind has components. The components are in relays when trying to solve the question. How do they relay? If those relays are labeled, how does AI compare? This is the ultimate benchmark for AI.
LLMs are excellent at prediction. If something similar to prediction occurs in the human mind, it could be expressed as a level of relay. If there is time taken to answer a question, which means sifting through different aspects of mind, this too, can be expressed as a form of relay.
The components of the human mind can be said to have characteristics. These characteristics have their strengths and weaknesses. There are commonalities in how they work with most humans, but there are also specificities to people, which may determine how excellent they are at subjects, interests, or goals.
The components of the mind can be assumed to be electrical and chemical signals because, conceptually, they are the most dynamic elements in the brain, involved in every functional purpose. They outdo neurons because neurons have to fire to work. Firing means electrical signals, from and towards chemical signals.
Characteristics of [sets of] electrical signals include sequences, which is the path of travel, which could be old or new. This, conceptually, means that some electrical signals in a set may depart from a certain side of [a set of] chemical signals. If it is a side that is regularly used, then it would be an old sequence, but if not, it would be a new sequence.
Old sequences work for routines and procedures, but may sometimes lead to boredom, while new sequences are useful for exploration and adventure. LLMs may sometimes use new sequences, especially for questions they have not seen before.
There is also a split of electrical signals, which, conceptually, is some going ahead of others, to interact with chemical signals like before, such that if there is a fit or match processing continues, and if not, the incoming one goes to another set [or corrects the error]. This explains predictive coding, processing, and prediction errors.
Before the reasoning models, it can be assumed that LLMs had only the first, without a split or follow, but the new models are able to somewhat return, correcting some errors.
There are also distributions, as the relays of electrical signals in multiple directions, both in prioritization and pre-prioritization. This means that when humans are in a situation, there are realizations on what to do or not do, what is in place or not, and so forth, by distributions.
AI does not have this multiple distribution capability yet, since it is mostly as one prioritization [like the human mind] and a few pre-prioritizations, unlike the human mind that is multiple. The ability for prioritization switches for AI does not compare to humans.
Chemical signals, in sets, also have their characteristics, which can be used to measure how good LLMs are getting.
Artificial intelligence can only be evaluated in comparison to human intelligence, not with tests like Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (MMMU), MLE-bench and the FrontierMath test.
There is a recent news in Nature, How should we test AI for human-level intelligence? OpenAI’s o3 electrifies quest, stating that, “Although the term AGI is often used to describe a computing system that meets or surpasses human cognitive abilities across a broad range of tasks, no technical definition for it exists. As a result, there is no consensus on when AI tools might achieve AGI. Some say the moment has already arrived; others say it is still far away. Many tests are being developed to track progress towards AGI. Some, including Rein’s 2023 Google-Proof Q&A2, are intended to assess an AI system’s performance on PhD-level science problems. OpenAI’s 2024 MLE-bench pits an AI system against 75 challenges hosted on Kaggle, an online data-science competition platform. The challenges include real-world problems such as translating ancient scrolls and developing vaccines.”
There is a recent future perfect on Vox, It’s getting harder to measure just how good AI is getting, stating that, “The problem is that AIs have been improving so fast that they keep making benchmarks worthless. Once an AI performs well enough on a benchmark we say the benchmark is “saturated,” meaning it’s no longer usefully distinguishing how capable the AIs are, because all of them get near-perfect scores. 2024 was the year in which benchmark after benchmark for AI capabilities became as saturated as the Pacific Ocean. We used to test AIs against a physics, biology, and chemistry benchmark called GPQA that was so difficult that even PhD students in the corresponding fields would generally score less than 70 percent. But the AIs now perform better than humans with relevant PhDs, so it’s not a good way to measure further progress.”