Key Takeaways
- A Q-learning RL agent autonomously learns optimal Spark configurations by observing dataset characteristics, experimenting with different settings, and learning from performance feedback.
- Combining an RL agent with Adaptive Query Execution (AQE) outperforms either approach alone, with RL choosing optimal initial configurations and AQE adapting them at runtime.
- Bucketing continuous dataset features (rows, size, cardinality, skew) into discrete categories allows tabular Q-learning to generalize across similar workloads, solving the fundamental challenge of learning from limited examples rather than requiring identical datasets for every decision.
- Starting with aggressive exploration (ε=0.3) and gradually reducing it (ε=0.05) allows the agent to discover optimal configurations early while increasingly exploiting learned knowledge for stable production performance.
- The partition optimizer agent provides a reusable design that can be extended to other configuration domains, such as memory, cores, and cache, where each agent can independently learn policies for its specific area.
Introduction
The rapid expansion of big data systems has exposed the limitations of traditional optimization techniques, particularly in environments characterized by distributed architectures, dynamic workloads, and incomplete information. Every day, organizations process massive datasets to extract business insights such as analyzing customer behavior, predicting equipment failures, optimizing supply chains, and detecting fraud. These analytics workloads are executed using a variety of distributed data processing frameworks, each exposing large configuration parameters that critically influence performance. The paper, “A Survey on Automatic Parameter Tuning for Big Data Processing Systems” (Herodotou, Yuxing, and Jiaheng 2020), highlights the same problem and emphasizes the need for intelligent, automatic tuning systems that adapt to dynamic workloads and environments.
To address this need, this article presents a reinforcement learning (RL) approach that enables distributed computing systems to learn optimal configurations autonomously, much like an apprentice engineer who learns by doing. We implement a lightweight agent as a driver-side component that uses RL to choose configuration settings before a job runs.
We ground this approach in Apache Spark, a representative distributed computing framework that divides computation across hundreds of machines. Spark’s performance depends heavily on configuration parameters that are commonly set using static defaults or manually tuned by domain experts. Such approaches, however, fail to adapt as workload characteristics and data distributions evolve. When configurations are poorly chosen, analysis that should complete in minutes can stretch into hours, while cloud costs increase significantly. As datasets grow more diverse and workloads become increasingly dynamic, reliance on static or manual tuning becomes brittle and economically unsustainable.
After processing hundreds of jobs, the agent develops intuition about patterns: Small datasets with few categories require fewer workstations, while large datasets with many categories need more. The agent remembers every experiment perfectly, never forgets lessons learned, and automatically applies this accumulated wisdom to new workloads, essentially transforming months of expert tuning experience into intelligence that’s immediately available, 24/7. Rather than requiring engineers to reconfigure the system every time data characteristics change, the agent becomes smarter with every job it processes.
A Q-learning agent is a reinforcement learning agent that learns an optimal policy by iteratively estimating the expected long-term reward of taking specific actions in given states. In practice, such an agent observes dataset characteristics (e.g., row count, data size, cardinality, and skew). The agent experiments with different configuration settings, measures execution performance, and progressively learns which parameter choices work best for particular data patterns.
In this article, we compare three optimization strategies used by Apache Spark: the built-in Adaptive Query Execution (AQE), a standalone Q-learning-based agent, and a hybrid approach combining both. From this comparison, we observe that the hybrid strategy outperforms either approach alone by combining pre-execution intelligence (RL selecting optimal initial configurations) with runtime adaptation (AQE’s dynamic adjustments). Building on these single-agent results, the article then discusses a conceptual extension to a multi-agent reinforcement learning system composed of multiple independent specialized agents, each specializing in optimizing a distinct configuration domain, such as memory allocation, CPU cores, or caching strategies. Each agent becomes an expert in its domain while collectively contributing to workload optimization. By bridging concepts from reinforcement learning and distributed systems, this work establishes a foundation for intelligent, self-tuning big data infrastructures that learn from experience rather than relying on static rules or manual intervention.
The Problem: Spark Configuration Optimization
Spark’s performance depends heavily on configuration parameters such as shuffle partitions, memory allocation, and parallelism settings. Static defaults (e.g., two hundred shuffle partitions) fail to adapt to varying dataset characteristics such as data size, cardinality, and skew. Manual tuning requires deep domain expertise, is time-consuming, and configurations optimized for one workload often perform poorly on others. Consider StreamMetrics, a fictional video analytics company that processes viewing data for content creators. Their data engineering team faces this challenge daily: every morning, they run a lightweight report analyzing the previous day’s views across categories (science, music, entertainment) on a dataset with a few thousand rows.
By noon, they process weekly trending analysis on five hundred thousand rows to identify viral content. By the end of the month, they generate comprehensive creator reports aggregating millions of rows across hundreds of categories, with heavily skewed distributions. Some categories, like “gaming”, have millions of views, while niche categories like “origami tutorials” have hundreds. Additionally, content creators request ad hoc analysis throughout the day, with unpredictable data volumes and patterns.
Using Spark’s default two hundred shuffle partitions, the morning report wastes resources coordinating two hundred nearly-empty tasks for a tiny dataset, the weekly analysis runs reasonably well by accident, and the monthly report struggles because two hundred partitions can’t handle the massive, skewed data efficiently. The team could manually tune configurations for each workload type, but this requires constant maintenance as data patterns evolve, and last month’s optimal settings fail this month when a viral trend shifts category distributions. This is precisely the kind of dynamic environment where reinforcement learning can transform operations.
Here, I have written a simple Spark SQL query that groups video views by category, such as Science, Music, and Entertainment, on a dataset with a few thousand rows.
By default, this groupBy operation creates two hundred shuffle partitions as shown in Figure 1. However, this configuration is inefficient for a small dataset because Spark will launch an excessive number of tiny shuffle tasks and files, resulting in significant scheduling, disk I/O, and metadata overhead relative to the actual computation. Most partitions will be nearly empty, wasting CPU and memory resources while the driver and cluster spend more time coordinating tasks than processing data.

Figure 1: Two hundred tasks were created for a small file (Gandhi, 2026)
In practice, Spark developers often address this by manually setting a static partition count, using heuristics such as setting the count to be two times the number of cores or three times the number of executors (refer to Tuning Spark) to ensure tasks are large enough to be efficient:

On the other hand, two hundred partitions may not be well-suited for huge datasets as it will create bigger-sized partitions that will take more processing time and risk running out of memory.
In some cases, the partition size is determined through trial and error, experimenting with different datasets and workloads to find a balance between performance and overhead. However, such configurations may not generalize well across varying data sizes or workload characteristics. With Spark 3.0, there is Adaptive Query Execution. When enabled, Spark dynamically adjusts query plans based on actual data characteristics observed during execution, rather than relying solely on static estimates made during query planning.
However, AQE still begins execution using the default configuration, typically two hundred shuffle partitions, and only merges or adjusts them after collecting runtime statistics. This means it optimizes the reduce phase but cannot avoid the initial overhead of writing many small shuffle files, leaving some inefficiency for small or moderate datasets. Also, AQE will not increase the number of partitions beyond two hundred if more partitions are needed to improve performance.
Reinforcement learning can play a key role by dynamically adjusting these parameters to optimize performance across different conditions.
Reinforcement Learning
As defined in Reinforcement Learning: An Introduction (Sutton and Barto 2018), reinforcement learning involves learning what to do—how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, unlike with supervised learning, but instead must discover which actions yield the most reward by trying them. These two characteristics, trial-and-error search and delayed reward, are the two most important distinguishing features of reinforcement learning.
Formally, RL can be described as an AI agent interacting with an environment, sensing its state, taking actions, and receiving rewards as shown in Figure 2. Over time, the agent learns a policy (a mapping from states to actions) that maximizes expected long-term reward.
In this article, the RL agent observes dataset characteristics, tries different partition counts, measures performance, and gradually builds knowledge about which configurations work best for which data patterns. After repeated executions, the agent develops intuition comparable to that of an experienced engineer and automatically selects the correct number of partitions for different workloads.

Figure 2: Illustration of the standard reinforcement learning agent–environment interaction loop (Sutton and Barto 2018)
Implementation Workflow: Building a Q-Learning RL Agent
The Q-Learning RL agent we built is a custom-developed agent built on top of Apache Spark in the driver program. This implementation extends Spark by wrapping job submissions with an intelligent agent layer.
The following workflow demonstrates how our custom Q-learning RL agent perceives the Spark environment, takes actions, receives feedback, and learns over time. In the real world, large-scale data platforms processing billions of events daily face similar challenges, their data engineering teams run diverse workloads: real-time dashboards on recent data, periodic aggregation reports on millions of records, and comprehensive analytical queries across highly skewed distributions. A Q-learning RL agent can automate configuration tuning for these varied workloads, eliminating manual intervention, reducing cloud costs by optimizing resource allocation, and accelerating query performance, allowing engineering teams to focus on building features rather than tuning parameters.
Step 1: Agent Perceives Environment (State Observation)
When a Spark job is submitted, the agent’s State Observer module intercepts the job and examines the dataset to understand the current environment state.
print("nLoading data...")
df = spark.read.csv(data_path, header=True, inferSchema=True)
row_count = df.count()
The agent then extracts key features that characterize the workload:
num_rows = df.count()
sample_rows = df.limit(1000).collect()
from collections import Counter
category_values = [row.category for row in sample_rows]
category_counter = Counter(category_values)
category_cardinality = len(category_counter)
counts = list(category_counter.values())
skew_factor = np.std(counts) / np.mean(counts)
Features the agent observes:
- Number of rows – larger datasets typically require more partitions
- Number of columns – wider datasets may need additional partitions
- Number of unique categories – higher cardinality implies more partitions
- Data size (MB) – bigger datasets benefit from more partitions
- Average row size (bytes) – helps gauge data density
- Skew factor – measures uneven distribution; high skew requires adjustment
Agent Design Choice
The agent samples only one thousand rows (~100ms) rather than scanning the entire dataset, balancing accuracy with real-time decision-making. This lightweight observation mechanism allows the agent to make fast decisions even on large datasets.
Step 2: Agent Encodes State (Discretization for Generalization)
The State Encoder component converts continuous measurements into discrete state representations, enabling the agent to generalize learned knowledge across similar workloads.
# Custom discretization buckets designed for the agent
row_buckets = [100, 1000, 10000, 100000, 1000000]
size_buckets = [1, 10, 100, 1000]
card_buckets = [5, 10, 20, 50, 100]
skew_buckets = [0.1, 0.3, 0.5, 0.8, 1.0]
Example: Agent processes a dataset with five thousand rows, 1.23 MB, twelve categories, 0.48 skew
# Agent's discretization logic:
# Rows: 5000 falls between 1000 and 10000 → bucket_2
# Size: 1.23 MB falls between 1 and 10 → bucket_1
# Cardinality: 12 categories falls between 10 and 20 → bucket_2
# Skew: 0.48 falls between 0.3 and 0.5 → bucket_3
state_key = "rows_bucket_2|size_bucket_1|card_bucket_2|skew_bucket_3"
Why This Matters for Agent Learning:
Without discretization, the agent would treat a five thousand row dataset as completely different from a 5,001 row dataset, making learning impossible. By bucketing, the agent recognizes that datasets with one thousand to ten thousand rows share similar optimization patterns, enabling it to apply learned knowledge from previous jobs to new but similar workloads.
Step 3: Agent Selects Action (Epsilon-Greedy Policy)
The Action Selector component queries the agent’s learned Q-table and decides which partition count to try, balancing exploration (trying new configurations) with exploitation (using known good configurations).
# Agent's action space (custom-defined partition options)
actions = [8, 16, 32, 64, 128, 200, 400]
# Agent's exploration parameter
epsilon = 0.3
# Agent's decision logic
if random.random() < epsilon:
action = random.choice(actions) # EXPLORE: Try something new
action_type = "explore"
else:
action = max(Q[state_key],key=Q[state_key].get)# EXPLOIT: Use best known
action_type = "exploit"
Agent’s Memory (Q-Table Lookup): The agent maintains a Q-table storing its learned value estimates for each state-action pair:
Q["rows_bucket_2|size_bucket_1|card_bucket_2|skew_bucket_3"] = {
8: -0.405, # Agent tried this, took 0.405 seconds
16: -0.523, # Agent tried this, took 0.523 seconds
32: -0.650, # Agent tried this, took 0.650 seconds
64: 0.0, # Agent hasn't tried this yet
128: 0.0, # Agent hasn't tried this yet
200: -0.745, # Agent tried this, took 0.745 seconds (worst so far)
400: 0.0 # Agent hasn't tried this yet
}
Agent’s Decision:
The agent selects eight partitions because it has the highest Q-value (-0.405, closest to 0, meaning fastest).
Agent’s Learning Strategy:
ε = 0.3 (thirty percent exploration): Early in learning, the agent frequently experiments with untried configurations. Epsilon decay: With each job, ε gradually decreases toward 0.05. Why not ε = 0? – The agent maintains minimal exploration (five percent) to continuously discover better configurations as workload patterns evolve.
Step 4: Agent Acts on Environment (Configuration Application)
The Configuration Manager component applies the agent’s chosen action to the Spark environment:
# Agent injects its learned configuration into Spark
spark.conf.set("spark.sql.shuffle.partitions", "8")
# Spark job executes with agent-selected configuration
result_df = df.groupBy("category").count()
result_df.show()
Critical Point:
The agent doesn’t modify Spark’s internal logic, it operates as an intelligent wrapper that sets optimal configurations before job execution, then lets Spark’s native execution engine run.
Step 5: Agent Receives Reward (Performance Feedback)
After Spark completes the job, the Reward Calculator measures execution time, which serves as the agent’s learning signal. In this implementation, the reward is calculated solely based on execution time (reward = -execution_time), implicitly ignoring other factors such as running cost, memory pressure, failure risk, or resource utilization that more complex multi-objective systems might optimize:
# Agent measures job performance
start_time = time.time()
result_df = df.groupBy("category").count().collect()
execution_time = time.time() - start_time
# Agent's reward signal (negative because lower time is better)
reward = -execution_time # e.g., -0.321 seconds
Step 6: Agent Learns (Q-Value Update)
The Learning Engine updates the agent’s Q-table using the Q-learning equation (Q(s, a) leftarrow Q(s, a) + alpha bigl( r + gamma max_{a’} Q(s’, a’) – Q(s, a) bigr)
) incorporating the observed reward:
# Q-learning update formula (implemented in agent's learning engine)
alpha = 0.3 # Learning rate: how much to adjust from new experience
gamma = 0.1 # Discount factor: how much to value future rewards
old_q_value = Q[state_key][action]
max_future_q = max(Q[state_key].values())
new_q_value = old_q_value + alpha * (reward + gamma * max_future_q - old_q_value)
# Agent updates its memory
Q[state_key][action] = new_q_value
Example Learning Scenario:
If the previous iteration for this state had a duration of 0.4s (reward = -0.4), but the latest execution took 0.6s (reward = -0.6), the agent updates the Q-value downward, signaling that this action performed worse than expected. In the next iteration, the agent is more likely to explore alternative partition counts for this state.
Agent’s Continuous Improvement:
The agent persists its Q-table (as JSON) between jobs, accumulating organizational knowledge over weeks and months. Each new job provides a learning opportunity, and the agent’s policy becomes increasingly refined.
Experiment Results
To validate the agent’s effectiveness, comparative experiments were conducted using three optimization strategies on the same workload:
- AQE Only: Spark’s built-in Adaptive Query Execution
- RL Agent Only: Custom Q-learning agent with AQE disabled
- Hybrid (AQE + RL): Q-learning agent selecting initial configuration plus AQE runtime adaptation.
Performance Comparison:
The chart (Figure 3) below presents results derived from a small dataset (one thousand rows) with low-skewed data (0.162).

Figure 3: Execution Time for Small Dataset (Gandhi 2026)
The results demonstrate significant performance improvements:
| Strategy | Mean Execution Time | Improvement vs AQE only |
| AQE | 0.263 secs | Baseline |
| RL Agent Only | 0.175 secs | 33.3% faster |
| Hybrid (AQE + RL) | 0.142 secs | 46.0% faster |
The same experimentation was conducted on a large dataset with seventy-five thousand rows with highly skewed data (1.241), and the results indicate that performance improvements scale with dataset size and skew complexity, as shown in Figure 4:

Figure 4: Execution Time for Very Large Skewed Dataset (Gandhi 2026)
| Strategy | Mean Execution Time | Improvement vs AQE only |
| AQE | 0.457 secs | Baseline |
| RL Agent Only | 0.201 secs | 56.0% faster |
| Hybrid (AQE + RL) | 0.143 secs | 68.6% faster |
Key Findings:
The hybrid approach outperforms both AQE Only and RL Only, validating that pre-execution intelligence (RL choosing optimal initial configurations) and runtime adaptation (AQE’s dynamic adjustments) address complementary optimization opportunities.
Comparative Analysis: Two Key Insights
A Reinforcement Learning (RL) agent achieves a significantly faster execution time than the standard rule-based AQE. The agent’s advantage comes from its ability to learn and select optimal initial partition counts (e.g., eight for small datasets). This proactive configuration effectively eliminates shuffle overhead before execution even starts, a benefit that Spark’s default AQE cannot fully replicate, as AQE only addresses excessive partitions after the shuffle blocks have been materialized on disk.
A Hybrid Approach Delivers Best Performance. Combining RL with AQE creates two-stage optimization:
- Stage 1 (Pre-execution): RL agent sets optimal initial configuration based on learned patterns.
- Stage 2 (Runtime): AQE adapts to unexpected conditions (e.g., skew discovered during execution, partition size variance)
These experimental results demonstrate practical value for large-scale data platforms processing billions of events daily. By enabling AQE (which most Spark 3.0+ deployments already have) and implementing an RL agent, these platforms can potentially achieve performance improvements across varied workloads, as demonstrated in our experiments. These improvements can translate to reduced cloud costs through optimized resource allocation, accelerated query performance, delivering insights faster to business stakeholders, and freed engineering capacity.
Extending to Multi-Agent Systems with System Architecture
While the single-agent partition optimizer delivers significant improvements, large-scale data platforms face a more complex reality. Consider our platform running the daily aggregation job: The RL agent optimally sets correct shuffle partitions, but the job still fails due to out-of-memory errors because the executor memory wasn’t configured for the large join operations. Consider also that the real-time dashboard runs efficiently with x partitions, but repeatedly recomputes the same intermediate data because caching wasn’t enabled, wasting CPU cycles.
A single agent optimizing only partitions leaves substantial performance and cost savings on the table. Production workloads require simultaneous optimization across multiple dimensions with memory allocation for different operation types (joins vs aggregations), CPU parallelism for varying workload intensities (I/O-bound vs compute-intensive), and intelligent caching decisions for data reuse patterns. Manually coordinating these configurations is exponentially more complex than tuning partitions alone; an engineer must now consider how partition count interacts with memory settings, how memory allocation affects CPU utilization, and how caching impacts all of the above. This motivates extending the single-agent approach to multiple independent learning components, each optimizing a specific configuration domain independently.
The single-agent partition optimizer demonstrates the viability of reinforcement learning for Spark configuration, but production workloads require simultaneous optimization across multiple dimensions. A natural extension is to deploy specialized agents on the Spark driver, each responsible for a distinct configuration domain, and to have them learn independently of feedback from job execution. In this multi-agent architecture, a coordinator (a lightweight control layer that applies agent decisions in a fixed order without learning or optimizing policies) orchestrates three additional agents alongside the partition agent.
The Memory Agent optimizes executor memory allocation by monitoring memory usage patterns, garbage collection frequency, and spill-to-disk events. Based on observed workload characteristics, such as join-heavy operations requiring large hash tables versus filter-only queries with minimal memory footprint, it dynamically configures spark.executor.memory, spark.memory.fraction, and spark.memory.storageFraction to balance performance against resource waste.
The Core Agent learns optimal CPU parallelism by tracking core utilization, task wait times, and thread contention. It adjusts spark.executor.cores, spark.task.cpus, and spark.executor.instances to match workload characteristics, recognizing that IO-bound tasks benefit from higher parallelism while CPU-intensive operations suffer from context-switching overhead when over-parallelized.
The Cache Agent develops intelligent caching policies by measuring cache hit rates, eviction patterns, and recomputation costs. It decides whether to cache intermediate DataFrames, selects the appropriate storage level (memory-only, memory-and-disk, or disk-only), and configures spark.storage.memoryFraction and spark.rdd.compress based on data reuse patterns and available memory.
Each agent operates using the same Q-learning foundation as the partition optimizer, extracting relevant state features, maintaining its own Q-table, and updating based on job performance rewards. This decoupling allows each agent to specialize in its domain while the system achieves comprehensive workload optimization.
Figure 5 shows a high-level multi-agent system for the different agents we discussed above.

Figure 5: High-level multi-agent system for Apache Spark (Gandhi 2026)
Conclusion
This article demonstrates how reinforcement learning transforms the traditionally manual and error-prone process of Spark configuration tuning into an autonomous, adaptive optimization system. By implementing a Q-learning RL agent that observes dataset characteristics, experiments with different partition counts, and learns from performance feedback, the system develops expertise comparable to experienced engineers, but with perfect memory and systematic exploration. The experimental results validate this approach. The RL agent alone outperformed Spark’s default Adaptive Query Execution, while the hybrid strategy combining AQE with Q-learning delivered the best overall performance, revealing that pre-execution intelligence (RL choosing optimal initial configurations) and runtime adaptation (AQE’s dynamic adjustments) address complementary optimization opportunities.
It is important to acknowledge that our experiments were conducted on relatively small datasets (one thousand to seventy-five thousand rows) compared to production workloads that process billions of events daily. While these results demonstrate the viability of RL-based configuration optimization, larger-scale validation on petabyte-scale datasets with more complex query patterns would strengthen confidence in production deployment. Additionally, the current implementation focuses on a single configuration dimension (shuffle partitions); extending to multi-agent optimization across memory, CPU, and caching domains requires further experimentation to validate agent interactions and ensure stable convergence.
The proposed multi-agent architecture extends these concepts to comprehensive workload optimization, where multiple specialized agents independently learn policies for memory allocation, CPU core distribution, and caching strategies, each becoming an expert in its specific optimization domain. Looking forward, this architecture opens research directions including transfer learning across cluster environments, deep Q-networks for continuous state spaces, and context-aware policies that incorporate cluster topology. For data engineers managing production Spark workloads, this approach offers a practical path: instrument jobs to measure performance, implement a simple Q-learning RL agent for shuffle partitions, deploy it alongside existing systems, and let it learn from production traffic.
This work accumulates organizational knowledge that transforms months of tuning experience into reusable policies available to every future job. The convergence of reinforcement learning and distributed systems represents more than an optimization technique; it signals a shift toward autonomous infrastructure that learns from experience rather than following static rules, demonstrating that as big data systems grow more complex with thousands of configuration parameters and constantly evolving workloads, intelligent agents that learn, adapt, and optimize autonomously are not just convenient, they become necessary.
