Qwen3-Max-Thinking Rivals Google's Gemini 3 Pro More Than Ever. The Key Is In What Is Not Being Told

There are days when it feels like we open the phone and the dashboard changes again. Since ChatGPT broke out in November 2022, the AI race has continued to accelerate, and every few weeks a new model appears which promises to push the bar a little further. Sometimes it is an update, other times it is a “flagship” with a different surname, but the pattern repeats itself: more power, more ambition and an increasingly global story. In this context, China is gaining visibility in an increasingly evident way, and the name that is now entering the conversation is Qwen3-Max-Thinking, Alibaba’s proposal with which it wants to play in the same league as the great references of the moment.

At first glance, Qwen3-Max-Thinking might seem like just another name in the endless list of models. But there is a relevant nuance here: it presents it as its flagship model for reasoning tasks, and explicitly places it in the same conversation as the Gemini 3 Pro. The company claims that it has scaled parameters and invested computing resources in reinforcement to improve several dimensions at once, from factual knowledge and complex reasoning to following instructions, alignment with human preferences and agent capabilities. In other words: you are not just selling raw power, but a way to “think” better.

What benchmarks teach

To land that promise, the most useful thing is to look at the comparative table that we have in hand, with 19 benchmarks and a direct count: Gemini 3 Pro leads in 11, Qwen3-Max-Thinking does it in 8. This data, by itself, does not decide “who is better”but it does help to understand the type of fight that Alibaba poses when faced with Google. Here it is worth being very literal with what we are measuring: each benchmark focuses on a specific skill, from general knowledge to programming, use of tools, following instructions or long context analysis.

Model Performance Table

If we look for the point where Qwen3-Max-Thinking really hits home, there is one that stands out above the rest: following instructions and aligning with what humans prefer in a conversation. In Arena-Hard v2, Qwen wins with 90.2 compared to Gemini’s 81.7, which is the largest difference in its favor in the entire table (8.5 points above). It is not a minor nuance, because this type of benchmark does not reward only the technical “success”, but rather the final result that a person considers most useful when blindly comparing answers. Added to that is IFBench, where Qwen wins by the slightest (70.9 vs. 70.4). Translated into real life: when the user does not formulate a perfect instruction, when the assignment has ambiguity or requires interpreting intent, Qwen seems more oriented to nailing what is asked of him and doing it in a way that feels natural.

The other area where Qwen supports his “thinking model” narrative is mathematical reasoning and logical problem solving. On HMMT, in both the November 2025 and February 2025 issues, Qwen is ahead (94.7 vs. 93.3 and 98.0 vs. 97.5, respectively). And in IMOAnswerBench it also wins, although by a minimal margin: 83.9 versus 83.3. These numbers do not suggest a beating, but they do suggest a consistent pattern: when the problem requires several steps of logic and it is not solved only with memory or a nice answer, Qwen tends to take advantage.

To these improvements Alibaba adds a component that is already becoming the new standard: that the model does not remain in the text, but can act. In its presentation, the company talks about an adaptive use of tools that allows information to be retrieved on demand and a code interpreter to be invoked. And this orientation also appears in the benchmarks: in HLE (w/ tools), Qwen wins with 49.8 compared to 45.8 for Gemini, which suggests a better ability to perform when the model can rely on external tools. Here the fundamental change is important: it is no longer just “what he responds”, but how he investigates, how he decides what tool to use and how he synthesizes what he finds.

There is a part of this comparison where the Gemini 3 Pro feels more “engineer” than “conversational,” and it is precisely where many professional users put the focus. The Google model wins in MMLU-Pro and MMLU-Redux, two tests closely associated with general knowledge, and also in GPQA and HLE, which in this table appear as demanding evaluation benchmarks and complex questions. In code, Gemini wins in LiveCodeBench v6 and also in SWE Verified, which reinforces the idea that, for programming tasksis still a very solid bet. Added to this is AA-LCR, where it leads in analysis of long documents.

The fine print hides beyond the Price

At this point, there is a question that weighs as much as any benchmark: how much does it cost to use these models seriously. In standard prices per 1M tokens, the contrast is clear. In Gemini 3 Pro, the entry ranges between $2 and $4 depending on the tranche of entry tokens, while in Qwen3-Max the entry is listed at $1.2. But the most important difference appears at the output, which is where the “thought” of the model is paid: Gemini marks 12 to 18 dollars compared to the 6 dollars of Qwen. Translated into proportions, in standard use Gemini is approximately 1.67 times more expensive in entry and 2 times more expensive in exit in the usual section. If the tranche exceeds 200,000 entry tokens, the distance increases to 3.33 times in entry and 3 times in exit.

Gemini is approximately 1.67 times more expensive on entry and 2 times more expensive on exit in the usual section.

And here we come to the part that is usually left out of the conversation when everything focuses on power and price: what happens to your data when you use the model, and under what rules. In the case of Qwen, two worlds must be clearly separated. On the one hand there is consumer web chat, whose terms contemplate the use and storage of “User content” to develop and improve AI technologies, including de-identified content, and the possibility of processing it for new products and services. Furthermore, at least in our review, we have not foundor a clear control or a visible option that allows you to disable that use. On the other hand, no explicit reference to the EU or the GDPR appears in the reviewed material. In its privacy policy, Alibaba warns of international data transfers and notes that the service is generally provided from Singapore and that the data is usually processed in Singapore, Indonesia or China.

There is a Chinese startup creating the most amazing robots of the moment. It's called <a href= X Square ” width=”375″ height=”142″ src=”https://i.blogs.es/76fc7f/x-square/375_142.jpeg”/>

Alibaba, however, introduces important nuances. The Alibaba Cloud professional environment ensures that it does not use the data for training and that it encrypts the information with AES-256. It also explains that the treatment of conversations changes depending on the type of use: in direct API calls they are not saved, while in other modes history can be kept to improve the experience. Google introduces a comparable nuance: with the paid Gemini API, prompts and responses are not used to train models and are treated as confidential. To this framework we must point out another element of context: the Chinese National Intelligence Law, in its article 7, establishes that organizations and citizens must, in accordance with the law, “support, assist and cooperate” with national intelligence work, also maintaining the secrecy of what is known, a legal obligation that has generated concerns in the European Union and in other parts of the world.

Images | WorldOfSoftware with Gemini 3 Pro | Screenshot

In WorldOfSoftware | The number of new apps coming to the App Store has skyrocketed. We have a culprit: “vibe coding”